Apache Hadoop 3.0.0-alpha4 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.

The hadoop-azure-datalake file system now supports configuration of the Azure Data Lake Store account credentials using the standard Hadoop Credential Provider API. For details, please refer to the documentation on hadoop-azure-datalake and the Credential Provider API.

Add a new configuration - “yarn.app.mapreduce.am.webapp.port-range” to specify port-range for webapp launched by AM.

The following environment variables are deprecated. Set the corresponding configuration properties instead.

Environment Variable Configuration Property Configuration File
HTTPFS_TEMP hadoop.http.temp.dir httpfs-site.xml
HTTPFS_HTTP_PORT hadoop.httpfs.http.port httpfs-site.xml
HTTPFS_MAX_HTTP_HEADER_SIZE hadoop.http.max.request.header.size and hadoop.http.max.response.header.size httpfs-site.xml
HTTPFS_MAX_THREADS hadoop.http.max.threads httpfs-site.xml
HTTPFS_SSL_ENABLED hadoop.httpfs.ssl.enabled httpfs-site.xml
HTTPFS_SSL_KEYSTORE_FILE ssl.server.keystore.location ssl-server.xml
HTTPFS_SSL_KEYSTORE_PASS ssl.server.keystore.password ssl-server.xml

These default HTTP Services have been added.

Name Description
/conf Display configuration properties
/jmx Java JMX management interface
/logLevel Get or set log level per class
/logs Display log files
/stacks Display JVM stacks
/static/index.html The static home page

Script httpfs.sh has been deprecated, use hdfs httpfs instead. The new scripts are based on the Hadoop shell scripting framework. hadoop daemonlog is supported. SSL configurations are read from ssl-server.xml.

An invalidateCache command has been added to the KMS. The rollNewVersion semantics of the KMS has been improved so that after a key’s version is rolled, generateEncryptedKey of that key guarantees to return the EncryptedKeyVersion based on the new key version.

The new encryption options SSE-KMS and especially SSE-C must be considered experimental at present. If you are using SSE-C, problems may arise if the bucket mixes encrypted and unencrypted files. For SSE-KMS, there may be extra throttling of IO, especially with the fadvise=random option. You may wish to request an increase in your KMS IOPs limits.

Changed the serialized format of BlockTokenIdentifier to protocol buffers. Includes logic to decode both the old Writable format and the new PB format to support existing clients. Client implementations in other languages will require similar functionality.

To run live unit tests, create src/test/resources/auth-keys.xml with the same properties as in the deprecated contract-test-options.xml.

Changed the behavior of removing directories with sticky bits, so that it is closer to what most Unix/Linux users would expect.

Let yarn client exit with an informative error message if an incompatible Jersey library is used from client side.

Due to a remaining issue after HADOOP-13558, an UGI may still try to renew the TGT even though the UGI is created from an existing Subject. The renewal would fail because of non-existing keytab.

Fixing the issue means different behavior which is incompatible, however, configuration property “hadoop.treat.subject.external” is introduced to enable the fix (disabled by default). The behavior is the same as before when the fix is not enabled.

The “hdfs erasurecode” CLI command has been renamed to “hdfs ec” for ease-of-use.

The `hdfs ec` CLI command has been substantially reworked to make the calling patterns more similar to the `hdfs storagepolicies` command. See `hdfs ec -help` and the HDFS erasure coding documentation for more information.

A new introduced configuration key “hadoop.security.groups.shell.command.timeout” allows applying a finite wait timeout over the ‘id’ commands launched by the ShellBasedUnixGroupsMapping plugin. Values specified can be in any valid time duration units: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html#getTimeDuration-java.lang.String-long-java.util.concurrent.TimeUnit-

Value defaults to 0, indicating infinite wait (preserving existing behaviour).

The “rs-default” codec has been renamed to simply “rs” for simplicity. Previous configuration keys like “io.erasurecode.codec.rs-default” have also been renamed to match.

The FSImage on-disk format for INodeFile is changed to additionally include a field for Erasure Coded files. This optional field ‘erasureCodingPolicyID’ which is unit32 type is available for all Erasure Coded files and represents the Erasure Coding Policy ID. Previously, the ‘replication’ field in INodeFile disk format was overloaded to represent the same Erasure Coding Policy ID.

{{HdfsAdmin#setErasureCodingPolicy}} now takes a String {{ecPolicyName}} rather than an ErasureCodingPolicy object. The corresponding RPC’s wire format has also been modified.

The classpath implementing the s3a filesystem is now defined in core-default.xml. Attempting to instantiate an S3A filesystem instance using a Configuration instance which has not included the default resorts will fail. Applications should not be doing this anyway, as it will lose other critical configuration options needed by the filesystem.

Two new configuration keys, seq.io.sort.mb and seq.io.sort.factor have been introduced for the SequenceFile’s Sorter feature to replace older, deprecated property keys of io.sort.mb and io.sort.factor.

This only affects direct users of the org.apache.hadoop.io.SequenceFile.Sorter Java class. For controlling MR2’s internal sorting instead, use the existing config keys of mapreduce.task.io.sort.mb and mapreduce.task.io.sort.factor.

The HdfsAdmin erasure coding APIs (set, unset, get) are now usable by non-superusers based on appropriate file and directory permissions.

This JIRA sets the Netty 4 dependency to 4.0.23. This is an incompatible change for the 3.0 release line, as 3.0.0-alpha1 and 3.0.0-alpha2 depended on Netty 4.1.0.Beta5.

The NameNode metadata for storing erasure coding policies has changed.

HDFS will now restrict the set of erasure coding policies that can be set by users. The set of allowed policies can be configured via “dfs.namenode.ec.policies.enabled” on the NameNode. Please see the documentation for more details.

Allow a block to complete if the number of replicas on live nodes, decommissioning nodes and nodes in maintenance mode satisfies minimum replication factor. The fix prevents block recovery failure if replica of last block is being decommissioned. Vice versa, the decommissioning will be stuck, waiting for the last block to be completed. In addition, file close() operation will not fail due to last block being decommissioned.

By default, none of the built-in erasure coding policies are enabled. Users have to explicitly enable the erasure coding policy via the hdfs configuration ‘dfs.namenode.ec.policies.enabled’ before setting the policy on any directories.

Move the check for hadoop-site.xml to static initialization of the Configuration class.

Guava is updated to version 21.0.

In the background of merging this patch into trunk, there is a work, shaded Hadoop client artifacts and minicluster, on HADOOP-11804. hadoop-client has its own Guava which is shaded, so we can update dependency with minimum effect compare to previous HADOOP-11804.

See also HADOOP-14238 as related problem.

DistCpOptions has been changed to be constructed with a Builder pattern. This potentially affects applications that invoke DistCp with the Java API.

The scope of hadoop-hdfs’s dependency on hadoop-hdfs-client has changed from “compile” to “provided”. This may affect users who directly consume hadoop-hdfs, which is a private API. These users need to add a new dependency on hadoop-hdfs-client, or better yet, switch from hadoop-hdfs to hadoop-hdfs-client.

The secure user variables have been changed to be consistent with the rest of the environment variable changes:

Old New

Switch the default ADLS access token provider type from Custom to ClientCredential.

Metric preemptCall in FSOpDurations is no longer supported.

Minimum version of Apache Maven has been updated from 3.0 to 3.3.

xmlenc dependency has been removed. If you rely on the transitive dependency, you need to set the dependency explicitly in your code after this change.

Use configuration properties io.erasurecode.codec.{rs-legacy,rs,xor}.rawcoders to control erasure coding codec. These properties support codec fallback in case the previous codec is not loaded.

SharedInstanceProfileCredentialsProvider is removed after this change. Users should use InstanceProfileCredentialsProvider provided by AWS SDK instead, which itself enforces a singleton instance to reduce calls to AWS EC2 Instance Metadata Service.

Some of the existing fields in ErasureCodingPolicyProto have changed from required to optional. For system EC policies, these fields are populated from hardcoded values.

If a positive value is passed to command line switch -blocksperchunk, files with more blocks than this value will be split into chunks of `<blocksperchunk>` blocks to be transferred in parallel, and reassembled on the destination. By default, `<blocksperchunk>` is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when both the source file system supports getBlockLocations and target supports concat.

The deprecated ProcessTree methods getCumulativeVmem and getCumulativeRssmem have been removed.

When the config param “dfs.namenode.snapshot.capture.openfiles” is enabled, HDFS snapshots taken will additionally capture point-in-time copies of the open files that have valid leases. Even when the current version open files grow or shrink in size, the snapshot will always retain the immutable versions of these open files, just as in for all other closed files. Note: The file length captured for open files in the snapshot was the one recorded in NameNode at the time of snapshot and it may be shorter than what the client has written till then. In order to capture the latest length, the client can call hflush/hsync with the flag SyncFlag.UPDATE_LENGTH on the open files handles.

StorageTypes are now encoded in the BlockTokenIdentifier to ensure that the intended StorageType for writes is not tampered with on it’s way through the Client to the Datanode.

Apache Httpclient has been removed as a dependency. This library is End of Life: people using it should move to its {{httpcore}} successor. If you cannot do that, you must add an explicit dependency on {{httpclient}} in your classpath.

CodecRegistry uses ServiceLoader to dynamically load all implementations of RawErasureCoderFactory. In Hadoop 3.0, there are several built-in implementations, and user can also provide self-defined implementations with the corresponding resource files. For each codec, user can configure the order of the implementations with the configuration keys: `io.erasurecode.codec.rs.rawcoders` for the default RS codec, `io.erasurecode.codec.rs-legacy.rawcoders` for the legacy RS codec, `io.erasurecode.codec.xor.rawcoders` for the XOR codec. User can also configure self-defined codec with the configuration key like: `io.erasurecode.codec.self-defined.rawcoders`. For each codec, Hadoop will use the implementation according to the order configured. If the former implementation fails, it will fall back to call the latter one. The order is defined by a list of coder names separated by commas. The names for the built-in implementations are: `rs_native` and `rs_java` for the default RS codec, of which the former is a native implementation which leverages Intel ISA-L library, which is the default implementation and the latter is the implementation in pure Java, `rs-legacy_java` for the legacy RS codec, which is the default implementation in pure Java, `xor_native` and `xor_java` for the XOR codec, of which the former is the Intel ISA-L implementation which is the default one and the latter in pure Java.

WARNING: No release note provided for this change.

YARN application tags can no longer contain non-printable ASCII characters.

hadoop-auth and hadoop-hdfs-httpfs modules no longer generate dependencies.html via maven-project-info-reports-plugin.

This change removes the support in the shell scripts for Tomcat that was added in 3.0.0-alpha1.

Findbugs report is no longer part of the documentation.

Reverted HDFS-10797 to fix a scalability regression brought by the commit.

WARNING: No release note provided for this change.

The copy buffer size can be configured via the new parameter <copybuffersize>. By default the <copybuffersize> is set to 8KB.

Changes the type of JobConf.DEFAULT_LOG_LEVEL from a Log4J Level to a String. Clients that referenced this field will need to be recompiled and may need to alter their source to account for the type change. The level itself remains conceptually at “INFO”.

If -p option of distcp command is unspecified, block size is preserved.

Remove the BlockReport(NumOps,AvgTime) metrics emitted under the NameNodeActivity context in favor of StorageBlockReport(NumOps,AvgTime) which more accurately represent the metric. Same for the corresponding quantile metrics.

DistributedFileSystem#listStatusIterator(..) throws FileNotFoundException if directory got deleted during iterating over large list beyond ls limit.

This breaks rolling upgrades because it changes the major version of the NM state store schema. Therefore when a new NM comes up on an old state store it crashes.

The state store versions for this change have been updated in YARN-6798.

Hadoop 2.x clients do not pass the storage ID or target storage IDs when writing a block. For backwards compatibility, the DataNode will not require the presence of these fields. This means older clients are unable to write to a particular storage as chosen by the NameNode (e.g. HDFS-9806).

The WASB FileSystem now uses version 5.3.0 of the Azure Storage SDK.

Fix to wasb:// (Azure) file system that allows the concurrent I/O feature to be used with the secure mode feature.

ResourceManager will now record ResourceRequests from different attempts into different objects.