AWS provides storage, compute and other services around the world, in regions.
Data in S3 is stored buckets; each bucket is a single region.
There are some “special” regions: China, AWS GovCloud. It is believed that the S3A connector works in these places, at least to the extent that nobody has complained about it not working.
The S3A connector connects to Amazon S3 storage over HTTPS connections, either directly or through an HTTP proxy. HTTP HEAD and GET, PUT, POST and DELETE requests are invoked to perform different read/write operations against the store.
There are multiple ways to connect to an S3 bucket
The S3A connector supports all these; S3 Endpoints are the primary mechanism used -either explicitly declared or automatically determined from the declared region of the bucket.
Not supported: * AWS Snowball.
As of December 2023, AWS S3 uses Transport Layer Security (TLS) version 1.2 to secure the communications channel; the S3A client is does this through the Apache HttpClient library.
Third-party stores implementing the S3 API are also supported. These often only implement a subset of the S3 API; not all features are available. If TLS authentication is used, then the HTTPS certificates for the private stores MUST be installed on the JVMs on hosts within the Hadoop cluster.
See Working with Third-party S3 Stores after reading this document.
There are three core settings to connect to an S3 store, endpoint, region and whether or not to use path style access.
<property> <name>fs.s3a.endpoint</name> <description>AWS S3 endpoint to connect to. An up-to-date list is provided in the AWS Documentation: regions and endpoints. Without this property, the endpoint/hostname of the S3 Store is inferred from the value of fs.s3a.endpoint.region, fs.s3a.endpoint.fips and more. </description> </property> <property> <name>fs.s3a.endpoint.region</name> <value>REGION</value> <description>AWS Region of the data</description> </property> <property> <name>fs.s3a.path.style.access</name> <value>false</value> <description>Enable S3 path style access by disabling the default virtual hosting behaviour. Needed for AWS PrivateLink, S3 AccessPoints, and, generally, third party stores. Default: false. </description> </property>
Historically the S3A connector has preferred the endpoint as defined by the option fs.s3a.endpoint
. With the move to the AWS V2 SDK, there is more emphasis on the region, set by the fs.s3a.endpoint.region
option.
Normally, declaring the region in fs.s3a.endpoint.region
should be sufficient to set up the network connection to correctly connect to an AWS-hosted S3 store.
fs.s3a.endpoint
and fs.s3a.endpoint.region
are used to set values for S3 endpoint and region respectively.fs.s3a.endpoint.region
is configured with valid AWS region value, S3A will configure the S3 client to use this value. If this is set to a region that does not match your bucket, you will receive a 301 redirect response.fs.s3a.endpoint.region
is not set and fs.s3a.endpoint
is set with valid endpoint value, S3A will attempt to parse the region from the endpoint and configure S3 client to use the region value.fs.s3a.endpoint
and fs.s3a.endpoint.region
are not set, S3A will use us-east-2
as default region and enable cross region access. In this case, S3A does not attempt to override the endpoint while configuring the S3 client.fs.s3a.endpoint
is not set and fs.s3a.endpoint.region
is set to an empty string, S3A will configure S3 client without any region or endpoint override. This will allow fallback to S3 SDK region resolution chain. More details here.fs.s3a.endpoint
is set to central endpoint s3.amazonaws.com
and fs.s3a.endpoint.region
is not set, S3A will use us-east-2
as default region and enable cross region access. In this case, S3A does not attempt to override the endpoint while configuring the S3 client.fs.s3a.endpoint
is set to central endpoint s3.amazonaws.com
and fs.s3a.endpoint.region
is also set to some region, S3A will use that region value and enable cross region access. In this case, S3A does not attempt to override the endpoint while configuring the S3 client.When the cross region access is enabled while configuring the S3 client, even if the region set is incorrect, S3 SDK determines the region. This is done by making the request, and if the SDK receives 301 redirect response, it determines the region at the cost of a HEAD request, and caches it.
Please note that some endpoint and region settings that require cross region access are complex and improving over time. Hence, they may be considered unstable.
If you are working with third party stores, please check third party stores in detail.
See Timeouts.
<property> <name>fs.s3a.connection.maximum</name> <value>200</value> <description>Controls the maximum number of simultaneous connections to S3. This must be bigger than the value of fs.s3a.threads.max so as to stop threads being blocked waiting for new HTTPS connections. </description> </property> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>true</value> <description> Enables or disables SSL connections to AWS services. </description> </property> <property> <name>fs.s3a.ssl.channel.mode</name> <value>Default_JSSE</value> <description> TLS implementation and cipher options. Values: OpenSSL, Default, Default_JSSE, Default_JSSE_with_GCM Default_JSSE is not truly the the default JSSE implementation because the GCM cipher is disabled when running on Java 8. However, the name was not changed in order to preserve backwards compatibility. Instead, new mode called Default_JSSE_with_GCM delegates to the default JSSE implementation with no changes to the list of enabled ciphers. OpenSSL requires the wildfly JAR on the classpath and a compatible installation of the openssl binaries. It is often faster than the JVM libraries, but also trickier to use. </description> </property> <property> <name>fs.s3a.socket.send.buffer</name> <value>8192</value> <description> Socket send buffer hint to amazon connector. Represented in bytes. </description> </property> <property> <name>fs.s3a.socket.recv.buffer</name> <value>8192</value> <description> Socket receive buffer hint to amazon connector. Represented in bytes. </description> </property>
Connections to S3A stores can be made through an HTTP or HTTPS proxy.
<property> <name>fs.s3a.proxy.host</name> <description> Hostname of the (optional) proxy server for S3 connections. </description> </property> <property> <name>fs.s3a.proxy.ssl.enabled</name> <value>false</value> <description> Does the proxy use a TLS connection? </description> </property> <property> <name>fs.s3a.proxy.port</name> <description> Proxy server port. If this property is not set but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with the value of fs.s3a.connection.ssl.enabled). </description> </property> <property> <name>fs.s3a.proxy.username</name> <description>Username for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.password</name> <description>Password for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.domain</name> <description>Domain for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.workstation</name> <description>Workstation for authenticating with proxy server.</description> </property>
Sometimes the proxy can be source of problems, especially if HTTP connections are kept in the connection pool for some time. Experiment with the values of fs.s3a.connection.ttl
and fs.s3a.connection.request.timeout
if long-lived connections have problems.
S3 Buckets are hosted in different “regions”, the default being “US-East-1”. The S3A client talks to this region by default, issuing HTTP requests to the server s3.amazonaws.com
.
S3A can work with buckets from any region. Each region has its own S3 endpoint, documented by Amazon.
fs.s3a.endpoint.region
, or by explicitly setting fs.s3a.endpoint
and fs.s3a.endpoint.region
.us-east-1
endpoint allows access to the data, but newer storage types, particularly S3 Express are not supported.If the wrong endpoint is used, the request will fail. This may be reported as a 301/redirect error, or as a 400 Bad Request: take these as cues to check the endpoint setting of a bucket.
The up to date list of regions is Available online.
This list can be used to specify the endpoint of individual buckets, for example for buckets in the us-west-2 and EU/Ireland endpoints.
<property> <name>fs.s3a.bucket.us-west-2-dataset.endpoint.region</name> <value>us-west-2</value> </property> <property> <name>fs.s3a.bucket.eu-dataset.endpoint.region</name> <value>eu-west-1</value> </property>
AWS PrivateLink for Amazon S3 allows for a private connection to a bucket to be defined, with network access rules managing how a bucket can be accessed.
vpce-f264a96c-6d27bfa7c85e.s3.us-west-2.vpce.amazonaws.com
vpce
URL.<property> <name>fs.s3a.bucket.example-usw2.endpoint</name> <value>https://bucket.vpce-f264a96c-6d27bfa7c85e.s3.us-west-2.vpce.amazonaws.com/</value> </property> <property> <name>fs.s3a.bucket.example-usw2.path.style.access</name> <value>true</value> </property> <property> <name>fs.s3a.bucket.example-usw2.endpoint.region</name> <value>us-west-2</value> </property>
It is possible to use FIPs-compliant endpoints which support a restricted subset of TLS algorithms.
Amazon provide a specific set of FIPS endpoints to use so callers can be confident that the network communication is compliant with the standard: non-compliant algorithms are unavailable.
The boolean option fs.s3a.endpoint.fips
(default false
) switches the S3A connector to using the FIPS endpoint of a region.
<property> <name>fs.s3a.endpoint.fips</name> <value>true</value> <description>Use the FIPS endpoint</description> </property>
For a single bucket:
<property> <name>fs.s3a.bucket.noaa-isd-pds.endpoint.fips</name> <value>true</value> <description>Use the FIPS endpoint for the NOAA dataset</description> </property>
If fs.s3a.endpoint.fips
is true
, the endpoint option fs.s3a.endpoint
MUST NOT be set to any non-central endpoint value. If fs.s3a.endpoint.fips
is true
, the only optionally allowed value for fs.s3a.endpoint
is central endpoint s3.amazonaws.com
.
S3A error message if s3.eu-west-2.amazonaws.com
endpoint is used with FIPS:
Non central endpoint cannot be set when fs.s3a.endpoint.fips is true : https://s3.eu-west-2.amazonaws.com
S3A validation is used to fail-fast before the SDK returns error.
AWS SDK error message if S3A does not fail-fast:
A custom endpoint cannot be combined with FIPS: https://s3.eu-west-2.amazonaws.com
The SDK calculates the FIPS-specific endpoint without any awareness as to whether FIPs is supported by a region. The first attempt to interact with the service will fail
java.net.UnknownHostException: software.amazon.awssdk.core.exception.SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.: example-london-1.s3-fips.eu-west-2.amazonaws.com
For more details on endpoint and region settings, please check S3 endpoint and region settings in detail.
Important OpenSSL and FIPS endpoints
Linux distributions with an FIPS-compliant SSL library may not be compatible with wildfly. Always use with the JDK SSL implementation unless you are confident that the library is compatible, or wish to experiment with the settings outside of production deployments.
<property> <name>fs.s3a.ssl.channel.mode</name> <value>Default_JSSE</value> </property>
S3A supports S3 Access Point usage which improves VPC integration with S3 and simplifies your data’s permission model because different policies can be applied now on the Access Point level. For more information about why to use and how to create them make sure to read the official documentation.
Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. You can set the Access Point ARN property using the following per bucket configuration property:
<property> <name>fs.s3a.bucket.sample-bucket.accesspoint.arn</name> <value> {ACCESSPOINT_ARN_HERE} </value> <description>Configure S3a traffic to use this AccessPoint</description> </property>
This configures access to the sample-bucket
bucket for S3A, to go through the new Access Point ARN. So, for example s3a://sample-bucket/key
will now use your configured ARN when getting data from S3 instead of your bucket.
the name of the bucket used in the s3a:// URLs is irrelevant; it is not used when connecting with the store
Example
<property> <name>fs.s3a.bucket.example-ap.accesspoint.arn</name> <value>arn:aws:s3:eu-west-2:152813717728:accesspoint/ap-example-london</value> <description>AccessPoint bound to bucket name example-ap</description> </property>
The fs.s3a.accesspoint.required
property can also require all access to S3 to go through Access Points. This has the advantage of increasing security inside a VPN / VPC as you only allow access to known sources of data defined through Access Points. In case there is a need to access a bucket directly (without Access Points) then you can use per bucket overrides to disable this setting on a bucket by bucket basis i.e. fs.s3a.bucket.{YOUR-BUCKET}.accesspoint.required
.
<!-- Require access point only access --> <property> <name>fs.s3a.accesspoint.required</name> <value>true</value> </property> <!-- Disable it on a per-bucket basis if needed --> <property> <name>fs.s3a.bucket.example-bucket.accesspoint.required</name> <value>false</value> </property>
Before using Access Points make sure you’re not impacted by the following: - The endpoint for S3 requests will automatically change to use s3-accesspoint.REGION.amazonaws.{com | com.cn}
depending on the Access Point ARN. While considering endpoints, if you have any custom signers that use the host endpoint property make sure to update them if needed;
The storediag
command within the utility cloudstore JAR is recommended as the way to view and print settings.
If storediag
doesn’t connect to your S3 store, nothing else will.
Based on the experience of people who field support calls, here are some of the main connectivity issues which cause problems.
All hosts in the cluster need to have the configuration secrets; local environment variables are not enough.
If HTTPS/TLS is used for a private store, the relevant certificates MUST be installed everywhere.
For applications such as distcp, the options need to be passed with the job.
If your cluster is configured to use a private store, AWS-hosted buckets are not visible. If you wish to read access in a private store, you need to change the endpoint.
Private S3 stores generally expect path style access.
These usually surface rapidly and with meaningful messages.
Region errors generally surface as * UnknownHostException
* AWSRedirectException
“Received permanent redirect response to region”
Endpoint configuration problems can be more varied, as they are just HTTPS URLs.
When it works, it is fast. But it is fussy as to openSSL implementations, TLS protocols and more. Because it uses the native openssl binaries, operating system updates can trigger regressions.
Disabling it should be the first step to troubleshooting any TLS problems.
If there is a proxy, set it up correctly.