The hadoop-aliyun module provides support for Aliyun integration with Aliyun Object Storage Service (Aliyun OSS). The generated JAR file, hadoop-aliyun.jar also declares a transitive dependency on all external artifacts which are needed for this support — enabling downstream applications to easily use this support.
To make it part of Apache Hadoop’s default classpath, simply make sure that HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has ‘hadoop-aliyun’ in the list.
Aliyun OSS is an example of “an object store”. In order to achieve scalability and especially high availability, Aliyun OSS has relaxed some of the constraints which classic “POSIX” filesystems promise.
Specifically
Features of Hadoop relying on this can have unexpected behaviour. E.g. the AggregatedLogDeletionService of YARN will not remove the appropriate log files.
Your Aliyun credentials not only pay for services, they offer read and write access to the data. Anyone with the account can not only read your datasets — they can delete them.
Do not inadvertently share these credentials through means such as 1. Checking in to SCM any configuration files containing the secrets. 2. Logging them to a console, as they invariably end up being seen. 3. Defining filesystem URIs with the credentials in the URL, such as oss://accessKeyId:accessKeySecret@directory/file. They will end up in logs and error messages. 4. Including the secrets in bug reports.
If you do any of these: change your credentials immediately!
Specifically: on Aliyun E-MapReduce, oss:// is also supported but with a different implementation. If you are using Aliyun E-MapReduce, follow these instructions —and be aware that all issues related to Aliyun OSS integration in E-MapReduce can only be addressed by Aliyun themselves: please raise your issues with them.
<property> <name>fs.oss.accessKeyId</name> <description>Aliyun access key ID</description> </property> <property> <name>fs.oss.accessKeySecret</name> <description>Aliyun access key secret</description> </property> <property> <name>fs.oss.credentials.provider</name> <description> Class name of a credentials provider that implements com.aliyun.oss.common.auth.CredentialsProvider. Omit if using access/secret keys or another authentication mechanism. The specified class must provide an accessible constructor accepting java.net.URI and org.apache.hadoop.conf.Configuration, or an accessible default constructor. </description> </property>
<property> <name>fs.oss.endpoint</name> <description>Aliyun OSS endpoint to connect to. An up-to-date list is provided in the Aliyun OSS Documentation. </description> </property> <property> <name>fs.oss.proxy.host</name> <description>Hostname of the (optinal) proxy server for Aliyun OSS connection</description> </property> <property> <name>fs.oss.proxy.port</name> <description>Proxy server port</description> </property> <property> <name>fs.oss.proxy.username</name> <description>Username for authenticating with proxy server</description> </property> <property> <name>fs.oss.proxy.password</name> <description>Password for authenticating with proxy server.</description> </property> <property> <name>fs.oss.proxy.domain</name> <description>Domain for authenticating with proxy server.</description> </property> <property> <name>fs.oss.proxy.workstation</name> <description>Workstation for authenticating with proxy server.</description> </property> <property> <name>fs.oss.attempts.maximum</name> <value>20</value> <description>How many times we should retry commands on transient errors.</description> </property> <property> <name>fs.oss.connection.establish.timeout</name> <value>50000</value> <description>Connection setup timeout in milliseconds.</description> </property> <property> <name>fs.oss.connection.timeout</name> <value>200000</value> <description>Socket connection timeout in milliseconds.</description> </property> <property> <name>fs.oss.paging.maximum</name> <value>1000</value> <description>How many keys to request from Aliyun OSS when doing directory listings at a time. </description> </property> <property> <name>fs.oss.multipart.upload.size</name> <value>10485760</value> <description>Size of each of multipart pieces in bytes.</description> </property> <property> <name>fs.oss.multipart.upload.threshold</name> <value>20971520</value> <description>Minimum size in bytes before we start a multipart uploads or copy.</description> </property> <property> <name>fs.oss.multipart.download.size</name> <value>102400/value> <description>Size in bytes in each request from ALiyun OSS.</description> </property> <property> <name>fs.oss.buffer.dir</name> <description>Comma separated list of directories to buffer OSS data before uploading to Aliyun OSS</description> </property> <property> <name>fs.oss.acl.default</name> <value></vaule> <description>Set a canned ACL for bucket. Value may be private, public-read, public-read-write. </description> </property> <property> <name>fs.oss.server-side-encryption-algorithm</name> <value></vaule> <description>Specify a server-side encryption algorithm for oss: file system. Unset by default, and the only other currently allowable value is AES256. </description> </property> <property> <name>fs.oss.connection.maximum</name> <value>32</value> <description>Number of simultaneous connections to oss.</description> </property> <property> <name>fs.oss.connection.secure.enabled</name> <value>true</value> <description>Connect to oss over ssl or not, true by default.</description> </property>
To test oss:// filesystem client, two files which pass in authentication details to the test runner are needed.
Those two configuration files must be put into hadoop-tools/hadoop-aliyun/src/test/resources.
This file pre-exists and sources the configurations created in auth-keys.xml.
For most cases, no modification is needed, unless a specific, non-default property needs to be set during the testing.
This file triggers the testing of Aliyun OSS module. Without this file, none of the tests in this module will be executed
It contains the access key Id/secret and proxy information that are needed to connect to Aliyun OSS, and an OSS bucket URL should be also provided.
The contents of the bucket will be cleaned during the testing process, so do not use the bucket for any purpose other than testing.
Create file contract-test-options.xml under /test/resources. If a specific file fs.contract.test.fs.oss test path is not defined, those tests will be skipped. Credentials are also needed to run any of those tests, they can be copied from auth-keys.xml or through direct XInclude inclusion. Here is an example of contract-test-options.xml:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <include xmlns="http://www.w3.org/2001/XInclude" href="auth-keys.xml"/> <property> <name>fs.contract.test.fs.oss</name> <value>oss://spark-tests</value> </property> <property> <name>fs.oss.impl</name> <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value> </property> <property> <name>fs.oss.endpoint</name> <value>oss-cn-hangzhou.aliyuncs.com</value> </property> <property> <name>fs.oss.buffer.dir</name> <value>/tmp/oss</value> </property> <property> <name>fs.oss.multipart.download.size</name> <value>102400</value> </property> </configuration>