NOTE: Hadoop’s s3: and s3n: connectors have been removed. Please use s3a: as the connector to data hosted in S3 with Apache Hadoop.
Consult the s3n documentation for migration instructions.
See also:
Apache Hadoop’s hadoop-aws module provides support for AWS integration. applications to easily use this support.
To include the S3A client in Apache Hadoop’s default classpath:
Make sure thatHADOOP_OPTIONAL_TOOLS in hadoop-env.sh includes hadoop-aws in its list of optional modules to add in the classpath.
For client side interaction, you can declare that relevant JARs must be loaded in your ~/.hadooprc file:
hadoop_add_to_classpath_tools hadoop-aws
The settings in this file does not propagate to deployed applications, but it will work for local clients such as the hadoop fs command.
Hadoop’s “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations.
There other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself.
S3A depends upon two JARs, alongside hadoop-common and its dependencies.
The versions of hadoop-common and hadoop-aws must be identical.
To import the libraries into a Maven build, add hadoop-aws JAR to the build dependencies; it will pull in a compatible aws-sdk JAR.
The hadoop-aws JAR does not declare any dependencies other than that dependencies unique to it, the AWS SDK JAR. This is simplify excluding/tuning Hadoop dependency JARs in downstream applications. The hadoop-client or hadoop-common dependency must be declared
<properties> <!-- Your exact Hadoop version here--> <hadoop.version>3.0.0</hadoop.version> </properties> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-aws</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies>
Amazon S3 is an example of “an object store”. In order to achieve scalability and especially high availability, S3 has —as many other cloud object stores have done— relaxed some of the constraints which classic “POSIX” filesystems promise.
The S3Guard feature attempts to address some of these, but it cannot do so completely. Do read these warnings and consider how they apply.
For further discussion on these topics, please consult The Hadoop FileSystem API Definition.
Amazon S3 is an example of “an object store”. In order to achieve scalability and especially high availability, S3 has —as many other cloud object stores have done— relaxed some of the constraints which classic “POSIX” filesystems promise.
Specifically
The S3A clients mimics directories by:
Here are some of the consequences:
The object authorization model of S3 is much different from the file authorization model of HDFS and traditional file systems. The S3A client simply reports stub information from APIs that would query this metadata:
S3A does not really enforce any authorization checks on these stub permissions. Users authenticate to an S3 bucket using AWS credentials. It’s possible that object ACLs have been defined to enforce authorization at the S3 side, but this happens entirely within the S3 service, not within the S3A implementation.
Your AWS credentials not only pay for services, they offer read and write access to the data. Anyone with the credentials can not only read your datasets —they can delete them.
Do not inadvertently share these credentials through means such as
If you do any of these: change your credentials immediately!
On Amazon EMR s3a:// URLs are not supported; Amazon provide their own filesystem client, s3://. If you are using Amazon EMR, follow their instructions for use —and be aware that all issues related to S3 integration in EMR can only be addressed by Amazon themselves: please raise your issues with them.
Equally importantly: much of this document does not apply to the EMR s3:// client. Pleae consult the EMR storage documentation instead.
Except when interacting with public S3 buckets, the S3A client needs the credentials needed to interact with buckets.
The client supports multiple authentication mechanisms and can be configured as to which mechanisms to use, and their order of use. Custom implementations of com.amazonaws.auth.AWSCredentialsProvider may also be used.
<property> <name>fs.s3a.access.key</name> <description>AWS access key ID. Omit for IAM role-based or provider-based authentication.</description> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key. Omit for IAM role-based or provider-based authentication.</description> </property> <property> <name>fs.s3a.aws.credentials.provider</name> <description> Comma-separated class names of credential provider classes which implement com.amazonaws.auth.AWSCredentialsProvider. These are loaded and queried in sequence for a valid set of credentials. Each listed class must implement one of the following means of construction, which are attempted in order: 1. a public constructor accepting java.net.URI and org.apache.hadoop.conf.Configuration, 2. a public static method named getInstance that accepts no arguments and returns an instance of com.amazonaws.auth.AWSCredentialsProvider, or 3. a public default constructor. Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible S3 bucket without any credentials. Please note that allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases. It can be useful for accessing public data sets without requiring AWS credentials. If unspecified, then the default list of credential provider classes, queried in sequence, is: 1. org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider: supports static configuration of AWS access key ID and secret access key. See also fs.s3a.access.key and fs.s3a.secret.key. 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports configuration of AWS access key ID and secret access key in environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK. 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use of instance profile credentials if running in an EC2 VM. </description> </property> <property> <name>fs.s3a.session.token</name> <description> Session token, when using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as one of the providers. </description> </property>
S3A supports configuration via the standard AWS environment variables.
The core environment variables are for the access key and associated secret:
export AWS_ACCESS_KEY_ID=my.aws.key export AWS_SECRET_ACCESS_KEY=my.secret.key
If the environment variable AWS_SESSION_TOKEN is set, session authentication using “Temporary Security Credentials” is enabled; the Key ID and secret key must be set to the credentials for that specific sesssion.
export AWS_SESSION_TOKEN=SECRET-SESSION-TOKEN export AWS_ACCESS_KEY_ID=SESSION-ACCESS-KEY export AWS_SECRET_ACCESS_KEY=SESSION-SECRET-KEY
These environment variables can be used to set the authentication credentials instead of properties in the Hadoop configuration.
Important: These environment variables are generally not propagated from client to server when YARN applications are launched. That is: having the AWS environment variables set when an application is launched will not permit the launched application to access S3 resources. The environment variables must (somehow) be set on the hosts/processes where the work is executed.
The standard way to authenticate is with an access key and secret key using the properties in the configuration file.
The S3A client follows the following authentication chain:
S3A can be configured to obtain client authentication providers from classes which integrate with the AWS SDK by implementing the com.amazonaws.auth.AWSCredentialsProvider Interface. This is done by listing the implementation classes, in order of preference, in the configuration option fs.s3a.aws.credentials.provider.
Important: AWS Credential Providers are distinct from Hadoop Credential Providers. As will be covered later, Hadoop Credential Providers allow passwords and other secrets to be stored and transferred more securely than in XML configuration files. AWS Credential Providers are classes which can be used by the Amazon AWS SDK to obtain an AWS login from a different source in the system, including environment variables, JVM properties and configuration files.
There are three AWS Credential Providers inside the hadoop-aws JAR:
classname | description |
---|---|
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | Session Credentials |
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider | Simple name/secret credentials |
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider | Anonymous Login |
There are also many in the Amazon SDKs, in particular two which are automatically set up in the authentication chain:
classname | description |
---|---|
com.amazonaws.auth.InstanceProfileCredentialsProvider | EC2 Metadata Credentials |
com.amazonaws.auth.EnvironmentVariableCredentialsProvider | AWS Environment Variables |
Applications running in EC2 may associate an IAM role with the VM and query the EC2 Instance Metadata Service for credentials to access S3. Within the AWS SDK, this functionality is provided by InstanceProfileCredentialsProvider, which internally enforces a singleton instance in order to prevent throttling problem.
Temporary Security Credentials can be obtained from the Amazon Security Token Service; these consist of an access key, a secret key, and a session token.
To authenticate with these:
Example:
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value> </property> <property> <name>fs.s3a.access.key</name> <value>SESSION-ACCESS-KEY</value> </property> <property> <name>fs.s3a.secret.key</name> <value>SESSION-SECRET-KEY</value> </property> <property> <name>fs.s3a.session.token</name> <value>SECRET-SESSION-TOKEN</value> </property>
The lifetime of session credentials are fixed when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS.
Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible S3 bucket without any credentials. It can be useful for accessing public data sets without requiring AWS credentials.
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value> </property>
Once this is done, there’s no need to supply any credentials in the Hadoop configuration or via environment variables.
This option can be used to verify that an object store does not permit unauthenticated access: that is, if an attempt to list a bucket is made using the anonymous credentials, it should fail —unless explicitly opened up for broader access.
hadoop fs -ls \ -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \ s3a://landsat-pds/
Allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases.
If a list of credential providers is given in fs.s3a.aws.credentials.provider, then the Anonymous Credential provider must come last. If not, credential providers listed after it will be ignored.
Simple name/secret credentials with SimpleAWSCredentialsProvider
This is is the standard credential provider, which supports the secret key in fs.s3a.access.key and token in fs.s3a.secret.key values. It does not support authentication with logins credentials declared in the URLs.
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value> </property>
Apart from its lack of support of user:password details being included in filesystem URLs (a dangerous practise that is strongly discouraged), this provider acts exactly at the basic authenticator used in the default authentication chain.
This means that the default S3A authentication chain can be defined as
<property> <name>fs.s3a.aws.credentials.provider</name> <value> org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, com.amazonaws.auth.InstanceProfileCredentialsProvider </value> </property>
To protect the access/secret keys from prying eyes, it is recommended that you use either IAM role-based authentication (such as EC2 instance profile) or the credential provider framework securely storing them and accessing them through configuration. The following describes using the latter for AWS credentials in the S3A FileSystem.
The Hadoop Credential Provider Framework allows secure “Credential Providers” to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests.
The S3A configuration options with sensitive data (fs.s3a.secret.key, fs.s3a.access.key and fs.s3a.session.token) can have their data saved to a binary file stored, with the values being read in when the S3A filesystem URL is used for data access. The reference to this credential provider is all that is passed as a direct configuration option.
For additional reading on the Hadoop Credential Provider API see: Credential Provider API.
A credential file can be created on any Hadoop filesystem; when creating one on HDFS or a Unix filesystem the permissions are automatically set to keep the file private to the reader —though as directory permissions are not touched, users should verify that the directory containing the file is readable only by the current user.
hadoop credential create fs.s3a.access.key -value 123 \ -provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks hadoop credential create fs.s3a.secret.key -value 456 \ -provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks
A credential file can be listed, to see what entries are kept inside it
hadoop credential list -provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks Listing aliases for CredentialProvider: jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks fs.s3a.secret.key fs.s3a.access.key
At this point, the credentials are ready for use.
The URL to the provider must be set in the configuration property hadoop.security.credential.provider.path, either on the command line or in XML configuration files.
<property> <name>hadoop.security.credential.provider.path</name> <value>jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks</value> <description>Path to interrogate for protected credentials.</description> </property>
Because this property only supplies the path to the secrets file, the configuration option itself is no longer a sensitive item.
The property hadoop.security.credential.provider.path is global to all filesystems and secrets. There is another property, fs.s3a.security.credential.provider.path which only lists credential providers for S3A filesystems. The two properties are combined into one, with the list of providers in the fs.s3a. property taking precedence over that of the hadoop.security list (i.e. they are prepended to the common list).
<property> <name>fs.s3a.security.credential.provider.path</name> <value /> <description> Optional comma separated list of credential providers, a list which is prepended to that set in hadoop.security.credential.provider.path </description> </property>
Supporting a separate list in an fs.s3a. prefix permits per-bucket configuration of credential files.
Once the provider is set in the Hadoop configuration, hadoop commands work exactly as if the secrets were in an XML file.
hadoop distcp \ hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/ hadoop fs -ls s3a://glacier1/
The path to the provider can also be set on the command line:
hadoop distcp \ -D hadoop.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks \ hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/ hadoop fs \ -D fs.s3a.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks \ -ls s3a://glacier1/
Because the provider path is not itself a sensitive secret, there is no risk from placing its declaration on the command line.
All S3A client options are configured with options with the prefix fs.s3a..
The client supports Per-bucket configuration to allow different buckets to override the shared settings. This is commonly used to change the endpoint, encryption and authentication mechanisms of buckets. S3Guard options, various minor options.
Here are the S3A properties for use in production. The S3Guard options are documented in the S3Guard documenents; some testing-related options are covered in Testing.
<property> <name>fs.s3a.connection.maximum</name> <value>15</value> <description>Controls the maximum number of simultaneous connections to S3.</description> </property> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>true</value> <description>Enables or disables SSL connections to S3.</description> </property> <property> <name>fs.s3a.endpoint</name> <description>AWS S3 endpoint to connect to. An up-to-date list is provided in the AWS Documentation: regions and endpoints. Without this property, the standard region (s3.amazonaws.com) is assumed. </description> </property> <property> <name>fs.s3a.path.style.access</name> <value>false</value> <description>Enable S3 path style access ie disabling the default virtual hosting behaviour. Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting. </description> </property> <property> <name>fs.s3a.proxy.host</name> <description>Hostname of the (optional) proxy server for S3 connections.</description> </property> <property> <name>fs.s3a.proxy.port</name> <description>Proxy server port. If this property is not set but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with the value of fs.s3a.connection.ssl.enabled).</description> </property> <property> <name>fs.s3a.proxy.username</name> <description>Username for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.password</name> <description>Password for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.domain</name> <description>Domain for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.proxy.workstation</name> <description>Workstation for authenticating with proxy server.</description> </property> <property> <name>fs.s3a.attempts.maximum</name> <value>20</value> <description>How many times we should retry commands on transient errors.</description> </property> <property> <name>fs.s3a.connection.establish.timeout</name> <value>5000</value> <description>Socket connection setup timeout in milliseconds.</description> </property> <property> <name>fs.s3a.connection.timeout</name> <value>200000</value> <description>Socket connection timeout in milliseconds.</description> </property> <property> <name>fs.s3a.paging.maximum</name> <value>5000</value> <description>How many keys to request from S3 when doing directory listings at a time.</description> </property> <property> <name>fs.s3a.threads.max</name> <value>10</value> <description> Maximum number of concurrent active (part)uploads, which each use a thread from the threadpool.</description> </property> <property> <name>fs.s3a.socket.send.buffer</name> <value>8192</value> <description>Socket send buffer hint to amazon connector. Represented in bytes.</description> </property> <property> <name>fs.s3a.socket.recv.buffer</name> <value>8192</value> <description>Socket receive buffer hint to amazon connector. Represented in bytes.</description> </property> <property> <name>fs.s3a.threads.keepalivetime</name> <value>60</value> <description>Number of seconds a thread can be idle before being terminated.</description> </property> <property> <name>fs.s3a.max.total.tasks</name> <value>5</value> <description>Number of (part)uploads allowed to the queue before blocking additional uploads.</description> </property> <property> <name>fs.s3a.multipart.size</name> <value>100M</value> <description>How big (in bytes) to split upload or copy operations up into. A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. </description> </property> <property> <name>fs.s3a.multipart.threshold</name> <value>2147483647</value> <description>How big (in bytes) to split upload or copy operations up into. This also controls the partition size in renamed files, as rename() involves copying the source file(s). A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. </description> </property> <property> <name>fs.s3a.multiobjectdelete.enable</name> <value>true</value> <description>When enabled, multiple single-object delete requests are replaced by a single 'delete multiple objects'-request, reducing the number of requests. Beware: legacy S3-compatible object stores might not support this request. </description> </property> <property> <name>fs.s3a.acl.default</name> <description>Set a canned ACL for newly created and copied objects. Value may be Private, PublicRead, PublicReadWrite, AuthenticatedRead, LogDeliveryWrite, BucketOwnerRead, or BucketOwnerFullControl.</description> </property> <property> <name>fs.s3a.multipart.purge</name> <value>false</value> <description>True if you want to purge existing multipart uploads that may not have been completed/aborted correctly</description> </property> <property> <name>fs.s3a.multipart.purge.age</name> <value>86400</value> <description>Minimum age in seconds of multipart uploads to purge</description> </property> <property> <name>fs.s3a.signing-algorithm</name> <description>Override the default signing algorithm so legacy implementations can still be used</description> </property> <property> <name>fs.s3a.server-side-encryption-algorithm</name> <description>Specify a server-side encryption algorithm for s3a: file system. Unset by default. It supports the following values: 'AES256' (for SSE-S3), 'SSE-KMS' and 'SSE-C' </description> </property> <property> <name>fs.s3a.server-side-encryption.key</name> <description>Specific encryption key to use if fs.s3a.server-side-encryption-algorithm has been set to 'SSE-KMS' or 'SSE-C'. In the case of SSE-C, the value of this property should be the Base64 encoded key. If you are using SSE-KMS and leave this property empty, you'll be using your default's S3 KMS key, otherwise you should set this property to the specific KMS key id.</description> </property> <property> <name>fs.s3a.buffer.dir</name> <value>${hadoop.tmp.dir}/s3a</value> <description>Comma separated list of directories that will be used to buffer file uploads to.</description> </property> <property> <name>fs.s3a.block.size</name> <value>32M</value> <description>Block size to use when reading files using s3a: file system. </description> </property> <property> <name>fs.s3a.user.agent.prefix</name> <value></value> <description> Sets a custom value that will be prepended to the User-Agent header sent in HTTP requests to the S3 back-end by S3AFileSystem. The User-Agent header always includes the Hadoop version number followed by a string generated by the AWS SDK. An example is "User-Agent: Hadoop 2.8.0, aws-sdk-java/1.10.6". If this optional property is set, then its value is prepended to create a customized User-Agent. For example, if this configuration property was set to "MyApp", then an example of the resulting User-Agent would be "User-Agent: MyApp, Hadoop 2.8.0, aws-sdk-java/1.10.6". </description> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> <description>The implementation class of the S3A Filesystem</description> </property> <property> <name>fs.AbstractFileSystem.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3A</value> <description>The implementation class of the S3A AbstractFileSystem.</description> </property> <property> <name>fs.s3a.readahead.range</name> <value>64K</value> <description>Bytes to read ahead during a seek() before closing and re-opening the S3 HTTP connection. This option will be overridden if any call to setReadahead() is made to an open stream.</description> </property> <property> <name>fs.s3a.list.version</name> <value>2</value> <description>Select which version of the S3 SDK's List Objects API to use. Currently support 2 (default) and 1 (older API).</description> </property>
Different S3 buckets can be accessed with different S3A client configurations. This allows for different endpoints, data read and write strategies, as well as login details.
As an example, a configuration could have a base configuration to use the IAM role information available when deployed in Amazon EC2.
<property> <name>fs.s3a.aws.credentials.provider</name> <value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value> </property>
This will become the default authentication mechanism for S3A buckets.
A bucket s3a://nightly/ used for nightly data can then be given a session key:
<property> <name>fs.s3a.bucket.nightly.access.key</name> <value>AKAACCESSKEY-2</value> </property> <property> <name>fs.s3a.bucket.nightly.secret.key</name> <value>SESSIONSECRETKEY</value> </property> <property> <name>fs.s3a.bucket.nightly.session.token</name> <value>Short-lived-session-token</value> </property> <property> <name>fs.s3a.bucket.nightly.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value> </property>
Finally, the public s3a://landsat-pds/ bucket can be accessed anonymously:
<property> <name>fs.s3a.bucket.landsat-pds.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value> </property>
Although most properties are automatically propagated from their fs.s3a.bucket.-prefixed custom entry to that of the base fs.s3a. option supporting secrets kept in Hadoop credential files is slightly more complex. This is because the property values are kept in these files, and cannot be dynamically patched.
Instead, callers need to create different configuration files for each bucket, setting the base secrets (fs.s3a.access.key, etc), then declare the path to the appropriate credential file in a bucket-specific version of the property fs.s3a.security.credential.provider.path.
S3 Buckets are hosted in different “regions”, the default being “US-East”. The S3A client talks to this region by default, issing HTTP requests to the server s3.amazonaws.com.
S3A can work with buckets from any region. Each region has its own S3 endpoint, documented by Amazon.
While it is generally simpler to use the default endpoint, working with V4-signing-only regions (Frankfurt, Seoul) requires the endpoint to be identified. Expect better performance from direct connections —traceroute will give you some insight.
If the wrong endpoint is used, the request may fail. This may be reported as a 301/redirect error, or as a 400 Bad Request: take these as cues to check the endpoint setting of a bucket.
Here is a list of properties defining all AWS S3 regions, current as of June 2017:
<!-- This is the default endpoint, which can be used to interact with any v2 region. --> <property> <name>central.endpoint</name> <value>s3.amazonaws.com</value> </property> <property> <name>canada.endpoint</name> <value>s3.ca-central-1.amazonaws.com</value> </property> <property> <name>frankfurt.endpoint</name> <value>s3.eu-central-1.amazonaws.com</value> </property> <property> <name>ireland.endpoint</name> <value>s3-eu-west-1.amazonaws.com</value> </property> <property> <name>london.endpoint</name> <value>s3.eu-west-2.amazonaws.com</value> </property> <property> <name>mumbai.endpoint</name> <value>s3.ap-south-1.amazonaws.com</value> </property> <property> <name>ohio.endpoint</name> <value>s3.us-east-2.amazonaws.com</value> </property> <property> <name>oregon.endpoint</name> <value>s3-us-west-2.amazonaws.com</value> </property> <property> <name>sao-paolo.endpoint</name> <value>s3-sa-east-1.amazonaws.com</value> </property> <property> <name>seoul.endpoint</name> <value>s3.ap-northeast-2.amazonaws.com</value> </property> <property> <name>singapore.endpoint</name> <value>s3-ap-southeast-1.amazonaws.com</value> </property> <property> <name>sydney.endpoint</name> <value>s3-ap-southeast-2.amazonaws.com</value> </property> <property> <name>tokyo.endpoint</name> <value>s3-ap-northeast-1.amazonaws.com</value> </property> <property> <name>virginia.endpoint</name> <value>${central.endpoint}</value> </property>
This list can be used to specify the endpoint of individual buckets, for example for buckets in the central and EU/Ireland endpoints.
<property> <name>fs.s3a.bucket.landsat-pds.endpoint</name> <value>${central.endpoint}</value> <description>The endpoint for s3a://landsat-pds URLs</description> </property> <property> <name>fs.s3a.bucket.eu-dataset.endpoint</name> <value>${ireland.endpoint}</value> <description>The endpoint for s3a://eu-dataset URLs</description> </property>
Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable.
The original S3A client implemented file writes by buffering all data to disk as it was written to the OutputStream. Only when the stream’s close() method was called would the upload start.
This can made output slow, especially on large uploads, and could even fill up the disk space of small (virtual) disks.
Hadoop 2.7 added the S3AFastOutputStream alternative, which Hadoop 2.8 expanded. It is now considered stable and has replaced the original S3AOutputStream, which is no longer shipped in hadoop.
The “fast” output stream
Because it starts uploading while data is still being written, it offers significant benefits when very large amounts of data are generated. The in memory buffering mechanims may also offer speedup when running adjacent to S3 endpoints, as disks are not used for intermediate data storage.
<property> <name>fs.s3a.fast.upload.buffer</name> <value>disk</value> <description> The buffering mechanism to use. Values: disk, array, bytebuffer. "disk" will use the directories listed in fs.s3a.buffer.dir as the location(s) to save data prior to being uploaded. "array" uses arrays in the JVM heap "bytebuffer" uses off-heap memory within the JVM. Both "array" and "bytebuffer" will consume memory in a single stream up to the number of blocks set by: fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks. If using either of these mechanisms, keep this value low The total number of threads performing work across all threads is set by fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the number of queued work items. </description> </property> <property> <name>fs.s3a.multipart.size</name> <value>100M</value> <description>How big (in bytes) to split upload or copy operations up into. A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. </description> </property> <property> <name>fs.s3a.fast.upload.active.blocks</name> <value>8</value> <description> Maximum Number of blocks a single output stream can have active (uploading, or queued to the central FileSystem instance's pool of queued operations. This stops a single stream overloading the shared thread pool. </description> </property>
Notes
If the amount of data written to a stream is below that set in fs.s3a.multipart.size, the upload is performed in the OutputStream.close() operation —as with the original output stream.
The published Hadoop metrics monitor include live queue length and upload operation counts, so identifying when there is a backlog of work/ a mismatch between data generation rates and network bandwidth. Per-stream statistics can also be logged by calling toString() on the current stream.
Files being written are still invisible untl the write completes in the close() call, which will block until the upload is completed.
When fs.s3a.fast.upload.buffer is set to disk, all data is buffered to local hard disks prior to upload. This minimizes the amount of memory consumed, and so eliminates heap size as the limiting factor in queued uploads —exactly as the original “direct to disk” buffering.
<property> <name>fs.s3a.fast.upload.buffer</name> <value>disk</value> </property> <property> <name>fs.s3a.buffer.dir</name> <value>${hadoop.tmp.dir}/s3a</value> <description>Comma separated list of directories that will be used to buffer file uploads to.</description> </property>
This is the default buffer mechanism. The amount of data which can be buffered is limited by the amount of available disk space.
When fs.s3a.fast.upload.buffer is set to bytebuffer, all data is buffered in “Direct” ByteBuffers prior to upload. This may be faster than buffering to disk, and, if disk space is small (for example, tiny EC2 VMs), there may not be much disk space to buffer with.
The ByteBuffers are created in the memory of the JVM, but not in the Java Heap itself. The amount of data which can be buffered is limited by the Java runtime, the operating system, and, for YARN applications, the amount of memory requested for each container.
The slower the upload bandwidth to S3, the greater the risk of running out of memory —and so the more care is needed in tuning the upload settings.
<property> <name>fs.s3a.fast.upload.buffer</name> <value>bytebuffer</value> </property>
When fs.s3a.fast.upload.buffer is set to array, all data is buffered in byte arrays in the JVM’s heap prior to upload. This may be faster than buffering to disk.
The amount of data which can be buffered is limited by the available size of the JVM heap heap. The slower the write bandwidth to S3, the greater the risk of heap overflows. This risk can be mitigated by tuning the upload settings.
<property> <name>fs.s3a.fast.upload.buffer</name> <value>array</value> </property>
Both the Array and Byte buffer buffer mechanisms can consume very large amounts of memory, on-heap or off-heap respectively. The disk buffer mechanism does not use much memory up, but will consume hard disk capacity.
If there are many output streams being written to in a single process, the amount of memory or disk used is the multiple of all stream’s active memory/disk use.
Careful tuning may be needed to reduce the risk of running out memory, especially if the data is buffered in memory.
There are a number parameters which can be tuned:
The total number of threads available in the filesystem for data uploads or any other queued filesystem operation. This is set in fs.s3a.threads.max
The number of operations which can be queued for execution:, awaiting a thread: fs.s3a.max.total.tasks
The number of blocks which a single output stream can have active, that is: being uploaded by a thread, or queued in the filesystem thread queue: fs.s3a.fast.upload.active.blocks
How long an idle thread can stay in the thread pool before it is retired: fs.s3a.threads.keepalivetime
When the maximum allowed number of active blocks of a single stream is reached, no more blocks can be uploaded from that stream until one or more of those active blocks’ uploads completes. That is: a write() call which would trigger an upload of a now full datablock, will instead block until there is capacity in the queue.
How does that come together?
As the pool of threads set in fs.s3a.threads.max is shared (and intended to be used across all threads), a larger number here can allow for more parallel operations. However, as uploads require network bandwidth, adding more threads does not guarantee speedup.
The extra queue of tasks for the thread pool (fs.s3a.max.total.tasks) covers all ongoing background S3A operations (future plans include: parallelized rename operations, asynchronous directory operations).
When using memory buffering, a small value of fs.s3a.fast.upload.active.blocks limits the amount of memory which can be consumed per stream.
When using disk buffering a larger value of fs.s3a.fast.upload.active.blocks does not consume much memory. But it may result in a large number of blocks to compete with other filesystem operations.
We recommend a low value of fs.s3a.fast.upload.active.blocks; enough to start background upload without overloading other parts of the system, then experiment to see if higher values deliver more throughtput —especially from VMs running on EC2.
<property> <name>fs.s3a.fast.upload.active.blocks</name> <value>4</value> <description> Maximum Number of blocks a single output stream can have active (uploading, or queued to the central FileSystem instance's pool of queued operations. This stops a single stream overloading the shared thread pool. </description> </property> <property> <name>fs.s3a.threads.max</name> <value>10</value> <description>The total number of threads available in the filesystem for data uploads *or any other queued filesystem operation*.</description> </property> <property> <name>fs.s3a.max.total.tasks</name> <value>5</value> <description>The number of operations which can be queued for execution</description> </property> <property> <name>fs.s3a.threads.keepalivetime</name> <value>60</value> <description>Number of seconds a thread can be idle before being terminated.</description> </property>
If an large stream writeoperation is interrupted, there may be intermediate partitions uploaded to S3 —data which will be billed for.
These charges can be reduced by enabling fs.s3a.multipart.purge, and setting a purge time in seconds, such as 86400 seconds —24 hours. When an S3A FileSystem instance is instantiated with the purge time greater than zero, it will, on startup, delete all outstanding partition requests older than this time.
<property> <name>fs.s3a.multipart.purge</name> <value>true</value> <description>True if you want to purge existing multipart uploads that may not have been completed/aborted correctly</description> </property> <property> <name>fs.s3a.multipart.purge.age</name> <value>86400</value> <description>Minimum age in seconds of multipart uploads to purge</description> </property>
If an S3A client is instantiated with fs.s3a.multipart.purge=true, it will delete all out of date uploads in the entire bucket. That is: it will affect all multipart uploads to that bucket, from all applications.
Leaving fs.s3a.multipart.purge to its default, false, means that the client will not make any attempt to reset or change the partition rate.
The best practise for using this option is to disable multipart purges in normal use of S3A, enabling only in manual/scheduled housekeeping operations.
The S3A Filesystem client supports the notion of input policies, similar to that of the Posix fadvise() API call. This tunes the behavior of the S3A client to optimise HTTP GET requests for the different use cases.
“sequential”
Read through the file, possibly with some short forward seeks.
The whole document is requested in a single HTTP request; forward seeks within the readahead range are supported by skipping over the intermediate data.
This is leads to maximum read throughput —but with very expensive backward seeks.
“normal” (default)
The “Normal” policy starts off reading a file in “sequential” mode, but if the caller seeks backwards in the stream, it switches from sequential to “random”.
This policy effectively recognizes the initial read pattern of columnar storage formats (e.g. Apache ORC and Apache Parquet), which seek to the end of a file, read in index data and then seek backwards to selectively read columns. The first seeks may be be expensive compared to the random policy, however the overall process is much less expensive than either sequentially reading through a file with the “random” policy, or reading columnar data with the “sequential” policy. When the exact format/recommended seek policy of data are known in advance, this policy
“random”
Optimised for random IO, specifically the Hadoop PositionedReadable operations —though seek(offset); read(byte_buffer) also benefits.
Rather than ask for the whole file, the range of the HTTP request is set to that that of the length of data desired in the read operation (Rounded up to the readahead value set in setReadahead() if necessary).
By reducing the cost of closing existing HTTP requests, this is highly efficient for file IO accessing a binary file through a series of PositionedReadable.read() and PositionedReadable.readFully() calls. Sequential reading of a file is expensive, as now many HTTP requests must be made to read through the file.
For operations simply reading through a file: copying, distCp, reading Gzipped or other compressed formats, parsing .csv files, etc, the sequential policy is appropriate. This is the default: S3A does not need to be configured.
For the specific case of high-performance random access IO, the random policy may be considered. The requirements are:
The desired fadvise policy must be set in the configuration option fs.s3a.experimental.input.fadvise when the filesystem instance is created. That is: it can only be set on a per-filesystem basis, not on a per-file-read basis.
<property> <name>fs.s3a.experimental.input.fadvise</name> <value>random</value> <description>Policy for reading files. Values: 'random', 'sequential' or 'normal' </description> </property>
HDFS-2744, Extend FSDataInputStream to allow fadvise proposes adding a public API to set fadvise policies on input streams. Once implemented, this will become the supported mechanism used for configuring the input IO policy.
Hadoop’s distcp application can be used to copy data between a Hadoop cluster and Amazon S3. See Copying Data Between a Cluster and Amazon S3 for details on S3 copying specifically.