The hadoop-azure module provides support for the Azure Data Lake Storage Gen2 storage layer through the “abfs” connector
To make it part of Apache Hadoop’s default classpath, make sure that HADOOP_OPTIONAL_TOOLS environment variable has hadoop-azure in the list, on every machine in the cluster
export HADOOP_OPTIONAL_TOOLS=hadoop-azure
You can set this locally in your .profile/.bashrc, but note it won’t propagate to jobs running in-cluster.
For details on ABFS, consult the following documents:
The Azure Storage data model presents 3 core concepts:
The ABFS connector connects to classic containers, or those created with Hierarchical Namespaces.
A key aspect of ADLS Gen 2 is its support for hierachical namespaces These are effectively directories and offer high performance rename and delete operations —something which makes a significant improvement in performance in query engines writing data to, including MapReduce, Spark, Hive, as well as DistCp.
This feature is only available if the container was created with “namespace” support.
You enable namespace support when creating a new Storage Account, by checking the “Hierarchical Namespace” option in the Portal UI, or, when creating through the command line, using the option --hierarchical-namespace true
You cannot enable Hierarchical Namespaces on an existing storage account
Containers in a storage account with Hierarchical Namespaces are not (currently) readable through the wasb: connector.
Some of the az storage command line commands fail too, for example:
$ az storage container list --account-name abfswales1 Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
The best documentation on getting started with Azure Datalake Gen2 with the abfs connector is Using Azure Data Lake Storage Gen2 with Azure HDInsight clusters
It includes instructions to create it from the Azure command line tool, which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum).
The az storage subcommand handles all storage commands, az storage account create does the creation.
Until the ADLS gen2 API support is finalized, you need to add an extension to the ADLS command.
az extension add --name storage-preview
Check that all is well by verifying that the usage command includes --hierarchical-namespace:
$ az storage account usage: az storage account create [-h] [--verbose] [--debug] [--output {json,jsonc,table,tsv,yaml,none}] [--query JMESPATH] --resource-group RESOURCE_GROUP_NAME --name ACCOUNT_NAME [--sku {Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}] [--location LOCATION] [--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}] [--tags [TAGS [TAGS ...]]] [--custom-domain CUSTOM_DOMAIN] [--encryption-services {blob,file,table,queue} [{blob,file,table,queue} ...]] [--access-tier {Hot,Cool}] [--https-only [{true,false}]] [--file-aad [{true,false}]] [--hierarchical-namespace [{true,false}]] [--bypass {None,Logging,Metrics,AzureServices} [{None,Logging,Metrics,AzureServices} ...]] [--default-action {Allow,Deny}] [--assign-identity] [--subscription _SUBSCRIPTION]
You can list locations from az account list-locations, which lists the name to refer to in the --location argument:
$ az account list-locations -o table DisplayName Latitude Longitude Name ------------------- ---------- ----------- ------------------ East Asia 22.267 114.188 eastasia Southeast Asia 1.283 103.833 southeastasia Central US 41.5908 -93.6208 centralus East US 37.3719 -79.8164 eastus East US 2 36.6681 -78.3889 eastus2 West US 37.783 -122.417 westus North Central US 41.8819 -87.6278 northcentralus South Central US 29.4167 -98.5 southcentralus North Europe 53.3478 -6.2597 northeurope West Europe 52.3667 4.9 westeurope Japan West 34.6939 135.5022 japanwest Japan East 35.68 139.77 japaneast Brazil South -23.55 -46.633 brazilsouth Australia East -33.86 151.2094 australiaeast Australia Southeast -37.8136 144.9631 australiasoutheast South India 12.9822 80.1636 southindia Central India 18.5822 73.9197 centralindia West India 19.088 72.868 westindia Canada Central 43.653 -79.383 canadacentral Canada East 46.817 -71.217 canadaeast UK South 50.941 -0.799 uksouth UK West 53.427 -3.084 ukwest West Central US 40.890 -110.234 westcentralus West US 2 47.233 -119.852 westus2 Korea Central 37.5665 126.9780 koreacentral Korea South 35.1796 129.0756 koreasouth France Central 46.3772 2.3730 francecentral France South 43.8345 2.1972 francesouth Australia Central -35.3075 149.1244 australiacentral Australia Central 2 -35.3075 149.1244 australiacentral2
Once a location has been chosen, create the account
az storage account create --verbose \ --name abfswales1 \ --resource-group devteam2 \ --kind StorageV2 \ --hierarchical-namespace true \ --location ukwest \ --sku Standard_LRS \ --https-only true \ --encryption-services blob \ --access-tier Hot \ --tags owner=engineering \ --assign-identity \ --output jsonc
The output of the command is a JSON file, whose primaryEndpoints command includes the name of the store endpoint:
{ "primaryEndpoints": { "blob": "https://abfswales1.blob.core.windows.net/", "dfs": "https://abfswales1.dfs.core.windows.net/", "file": "https://abfswales1.file.core.windows.net/", "queue": "https://abfswales1.queue.core.windows.net/", "table": "https://abfswales1.table.core.windows.net/", "web": "https://abfswales1.z35.web.core.windows.net/" } }
The abfswales1.dfs.core.windows.net account is the name by which the storage account will be referred to.
Now ask for the connection string to the store, which contains the account key
az storage account show-connection-string --name abfswales1 { "connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==" }
You then need to add the access key to your core-site.xml, JCEKs file or use your cluster management tool to set it the option fs.azure.account.key.STORAGE-ACCOUNT to this value.
<property> <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name> <value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value> </property>
Creation through the portal is covered in Quickstart: Create an Azure Data Lake Storage Gen2 storage account
Key Steps
You have now created your storage account. Next, get the key for authentication for using the default “Shared Key” authentication.
An Azure storage account can have multiple containers, each with the container name as the userinfo field of the URI used to reference it.
For example, the container “container1” in the storage account just created will have the URL abfs://container1@abfswales1.dfs.core.windows.net/
You can create a new container through the ABFS connector, by setting the option fs.azure.createRemoteFileSystemDuringInitialization to true.
If the container does not exist, an attempt to list it with hadoop fs -ls will fail
$ hadoop fs -ls abfs://container1@abfswales1.dfs.core.windows.net/ ls: `abfs://container1@abfswales1.dfs.core.windows.net/': No such file or directory
Enable remote FS creation and the second attempt succeeds, creating the container as it does so:
$ hadoop fs -D fs.azure.createRemoteFileSystemDuringInitialization=true \ -ls abfs://container1@abfswales1.dfs.core.windows.net/
This is useful for creating accounts on the command line, especially before the az storage command supports hierarchical namespaces completely.
You can use the Azure Storage Explorer
Any configuration can be specified generally (or as the default when accessing all accounts) or can be tied to a specific account. For example, an OAuth identity can be configured for use regardless of which account is accessed with the property fs.azure.account.oauth2.client.id or you can configure an identity to be used only for a specific storage account with fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net.
This is shown in the Authentication section.
Authentication for ABFS is ultimately granted by Azure Active Directory.
The concepts covered there are beyond the scope of this document to cover; developers are expected to have read and understood the concepts therein to take advantage of the different authentication mechanisms.
What is covered here, briefly, is how to configure the ABFS client to authenticate in different deployment situations.
The ABFS client can be deployed in different ways, with its authentication needs driven by them.
What can be changed is what secrets/credentials are used to authenticate the caller.
The authentication mechanism is set in fs.azure.account.auth.type (or the account specific variant), and, for the various OAuth options fs.azure.account.oauth.provider.type
All secrets can be stored in JCEKS files. These are encrypted and password protected —use them or a compatible Hadoop Key Management Store wherever possible
This is the simplest authentication mechanism of account + password.
The account name is inferred from the URL; the password, “key”, retrieved from the XML/JCECKs configuration files.
<property> <name>fs.azure.account.auth.type.abfswales1.dfs.core.windows.net</name> <value>SharedKey</value> <description> </description> </property> <property> <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name> <value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value> <description> The secret password. Never share these. </description> </property>
Note: The source of the account key can be changed through a custom key provider; one exists to execute a shell script to retrieve it.
OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file.
The specifics of this process is covered in hadoop-azure-datalake; the key names are slightly different here.
<property> <name>fs.azure.account.auth.type</name> <value>OAuth</value> <description> Use OAuth authentication </description> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value> <description> Use client credentials </description> </property> <property> <name>fs.azure.account.oauth2.client.endpoint</name> <value></value> <description> URL of OAuth endpoint </description> </property> <property> <name>fs.azure.account.oauth2.client.id</name> <value></value> <description> Client ID </description> </property> <property> <name>fs.azure.account.oauth2.client.secret</name> <value></value> <description> Secret </description> </property>
An OAuth 2.0 endpoint, username and password are provided in the configuration/JCEKS file.
<property> <name>fs.azure.account.auth.type</name> <value>OAuth</value> <description> Use OAuth authentication </description> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value>org.apache.hadoop.fs.azurebfs.oauth2.UserPasswordTokenProvider</value> <description> Use user and password </description> </property> <property> <name>fs.azure.account.oauth2.client.endpoint</name> <value></value> <description> URL of OAuth 2.0 endpoint </description> </property> <property> <name>fs.azure.account.oauth2.user.name</name> <value></value> <description> username </description> </property> <property> <name>fs.azure.account.oauth2.user.password</name> <value></value> <description> password for account </description> </property>
With an existing Oauth 2.0 token, make a request of the Active Directory endpoint https://login.microsoftonline.com/Common/oauth2/token for this token to be refreshed.
<property> <name>fs.azure.account.auth.type</name> <value>OAuth</value> <description> Use OAuth 2.0 authentication </description> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value>org.apache.hadoop.fs.azurebfs.oauth2.RefreshTokenBasedTokenProvider</value> <description> Use the Refresh Token Provider </description> </property> <property> <name>fs.azure.account.oauth2.refresh.token</name> <value></value> <description> Refresh token </description> </property> <property> <name>fs.azure.account.oauth2.client.id</name> <value></value> <description> Optional Client ID </description> </property>
Azure Managed Identities, formerly “Managed Service Identities”.
OAuth 2.0 tokens are issued by a special endpoint only accessible from the executing VM (http://169.254.169.254/metadata/identity/oauth2/token). The issued credentials can be used to authenticate.
The Azure Portal/CLI is used to create the service identity.
<property> <name>fs.azure.account.auth.type</name> <value>OAuth</value> <description> Use OAuth authentication </description> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value> <description> Use MSI for issuing OAuth tokens </description> </property> <property> <name>fs.azure.account.oauth2.msi.tenant</name> <value></value> <description> Optional MSI Tenant ID </description> </property> <property> <name>fs.azure.account.oauth2.client.id</name> <value></value> <description> Optional Client ID </description> </property>
A Custom OAuth 2.0 token provider supplies the ABFS connector with an OAuth 2.0 token when its getAccessToken() method is invoked.
<property> <name>fs.azure.account.auth.type</name> <value>Custom</value> <description> Custom Authentication </description> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value></value> <description> classname of Custom Authentication Provider </description> </property>
The declared class must implement org.apache.hadoop.fs.azurebfs.extensions.CustomTokenProviderAdaptee and optionally org.apache.hadoop.fs.azurebfs.extensions.BoundDTExtension.
The declared class also holds responsibility to implement retry logic while fetching access tokens.
The connector uses the JVM proxy settings to control its proxy setup.
See The Oracle Java documentation for the options to set.
As the connector uses HTTPS by default, the https.proxyHost and https.proxyPort options are those which must be configured.
In MapReduce jobs, including distcp, the proxy options must be set in both the mapreduce.map.java.opts and mapreduce.reduce.java.opts.
# this variable is only here to avoid typing the same values twice. # It's name is not important. export DISTCP_PROXY_OPTS="-Dhttps.proxyHost=web-proxy.example.com -Dhttps.proxyPort=80" hadoop distcp \ -D mapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \ -D mapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \ -update -skipcrccheck -numListstatusThreads 40 \ hdfs://namenode:8020/users/alice abfs://backups@account.dfs.core.windows.net/users/alice
Without these settings, even though access to ADLS may work from the command line, distcp access can fail with network errors.
As with other object stores, login secrets are valuable pieces of information. Organizations should have a process for safely sharing them.
As with all Azure storage services, the Azure Datalake Gen 2 store offers a fully consistent view of the store, with complete Create, Read, Update, and Delete consistency for data and metadata. (Compare and contrast with S3 which only offers Create consistency; S3Guard adds CRUD to metadata, but not the underlying data).
For containers with hierarchical namespaces, the scalability numbers are, in Big-O-notation, as follows:
Operation | Scalability |
---|---|
File Rename | O(1) |
File Delete | O(1) |
Directory Rename: | O(1) |
Directory Delete | O(1) |
For non-namespace stores, the scalability becomes:
Operation | Scalability |
---|---|
File Rename | O(1) |
File Delete | O(1) |
Directory Rename: | O(files) |
Directory Delete | O(files) |
That is: the more files there are, the slower directory operations get.
Further reading: Azure Storage Scalability Targets
The ABFS connector supports a number of limited-private/unstable extension points for third-parties to integrate their authentication and authorization services into the ABFS client.
Consult the source in org.apache.hadoop.fs.azurebfs.extensions and all associated tests to see how to make use of these extension points.
Warning These extension points are unstable.
Consult the javadocs for org.apache.hadoop.fs.azurebfs.constants.ConfigurationKeys, org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations and org.apache.hadoop.fs.azurebfs.AbfsConfiguration for the full list of configuration options and their default values.
Config fs.azure.enable.flush provides an option to render ABFS flush APIs - HFlush() and HSync() to be no-op. By default, this config will be set to true.
Both the APIs will ensure that data is persisted.
Config fs.azure.disable.outputstream.flush provides an option to render OutputStream Flush() API to be a no-op in AbfsOutputStream. By default, this config will be set to true.
Hflush() being the only documented API that can provide persistent data transfer, Flush() also attempting to persist buffered data will lead to performance issues.
Config fs.azure.enable.check.access needs to be set true to enable the AzureBlobFileSystem.access().
If you set fs.azure.abfs.latency.track to true, the module starts tracking the performance metrics of ABFS HTTP traffic. To obtain these numbers on your machine or cluster, you will also need to enable debug logging for the AbfsPerfTracker class in your log4j config. A typical perf log line appears like:
h=KARMA t=2019-10-25T20:21:14.518Z a=abfstest01.dfs.core.windows.net c=abfs-testcontainer-84828169-6488-4a62-a875-1e674275a29f cr=delete ce=deletePath r=Succeeded l=32 ls=32 lc=1 s=200 e= ci=95121dae-70a8-4187-b067-614091034558 ri=97effdcf-201f-0097-2d71-8bae00000000 ct=0 st=0 rt=0 bs=0 br=0 m=DELETE u=https%3A%2F%2Fabfstest01.dfs.core.windows.net%2Ftestcontainer%2Ftest%3Ftimeout%3D90%26recursive%3Dtrue
The fields have the following definitions:
h: host name t: time when this request was logged a: Azure storage account name c: container name cr: name of the caller method ce: name of the callee method r: result (Succeeded/Failed) l: latency (time spent in callee) ls: latency sum (aggregate time spent in caller; logged when there are multiple callees; logged with the last callee) lc: latency count (number of callees; logged when there are multiple callees; logged with the last callee) s: HTTP Status code e: Error code ci: client request ID ri: server request ID ct: connection time in milliseconds st: sending time in milliseconds rt: receiving time in milliseconds bs: bytes sent br: bytes received m: HTTP method (GET, PUT etc) u: Encoded HTTP URL
Note that these performance numbers are also sent back to the ADLS Gen 2 API endpoints in the x-ms-abfs-client-latency HTTP headers in subsequent requests. Azure uses these settings to track their end-to-end latency.
The problems associated with the connector usually come down to, in order
If you log org.apache.hadoop.fs.azurebfs.services at DEBUG then you will see more details about any request which is failing.
One useful tool for debugging connectivity is the cloudstore storediag utility.
This validates the classpath, the settings, then tries to work with the filesystem.
bin/hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag abfs://container@account.dfs.core.windows.net/
The hadoop-azure JAR is not on the classpah.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2625) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3290) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3322) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:136) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3373) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3341) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:491) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2529) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2623) ... 16 more
Tip: if this is happening on the command line, you can turn on debug logging of the hadoop scripts:
export HADOOP_SHELL_SCRIPT_DEBUG=true
If this is happening on an application running within the cluster, it means the cluster (somehow) needs to be configured so that the hadoop-azure module and dependencies are on the classpath of deployed applications.
The azure-storage JAR is not on the classpath.
The request wasn’t authenticated while using the default shared-key authentication mechanism.
Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD, https://account.dfs.core.windows.net/container2?resource=filesystem&timeout=90 at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:135) at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getFilesystemProperties(AbfsClient.java:209) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFilesystemProperties(AzureBlobFileSystemStore.java:259) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.fileSystemExists(AzureBlobFileSystem.java:859) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:110)
Causes include:
There’s no fs.azure.account.key. entry in your cluster configuration declaring the access key for the specific account, or you are using the wrong URL
$ hadoop fs -ls abfs://container@abfswales2.dfs.core.windows.net/ ls: Configuration property abfswales2.dfs.core.windows.net not found.
There is no container of the given name. Either it has been mistyped or the container needs to be created.
$ hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/ ls: `abfs://container@abfswales1.dfs.core.windows.net/': No such file or directory
The OAuth authentication page didn’t fail with an HTTP error code, but it didn’t return JSON either
$ bin/hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/ ... ls: HTTP Error 200; url='https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize' AADToken: HTTP connection to https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize failed for getting token from AzureAD. Unexpected response. Check configuration, URLs and proxy settings. proxies=none; requestId='dd9d526c-8b3d-4b3f-a193-0cf021938600'; contentType='text/html; charset=utf-8';
Likely causes are configuration and networking:
When using Azure Managed Identities, the files/directories in ADLS Gen2 by default will be owned by the service principal object id i.e. principal ID & submitting jobs as the local OS user ‘user1’ results in the above exception.
The fix is to mimic the ownership to the local OS user, by adding the below properties tocore-site.xml.
<property> <name>fs.azure.identity.transformer.service.principal.id</name> <value>service principal object id</value> <description> An Azure Active Directory object ID (oid) used as the replacement for names contained in the list specified by “fs.azure.identity.transformer.service.principal.substitution.list”. Notice that instead of setting oid, you can also set $superuser here. </description> </property> <property> <name>fs.azure.identity.transformer.service.principal.substitution.list</name> <value>user1</value> <description> A comma separated list of names to be replaced with the service principal ID specified by “fs.azure.identity.transformer.service.principal.id”. This substitution occurs when setOwner, setAcl, modifyAclEntries, or removeAclEntries are invoked with identities contained in the substitution list. Notice that when in non-secure cluster, asterisk symbol * can be used to match all user/group. </description> </property>
Once the above properties are configured, hdfs dfs -ls abfs://container1@abfswales1.dfs.core.windows.net/ shows the ADLS Gen2 files/directories are now owned by ‘user1’.
See the relevant section in Testing Azure.