Apache Hadoop 3.0.0-alpha2 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.


Aliyun OSS is widely used among China’s cloud users and this work implemented a new Hadoop compatible filesystem AliyunOSSFileSystem with oss scheme, similar to the s3a and azure support.


Log InvalidTokenException at trace level in DataXceiver#run().


Users:

In Apache Hadoop 3.0.0-alpha1, verification required environment variables with the format of HADOOP_(subcommand)_USER where subcommand was lowercase applied globally. This changes the format to be (command)_(subcommand)_USER where all are uppercase to be consistent with the _OPTS functionality as well as being able to set per-command options. Additionally, the check is now happening sooner, which should make it faster to fail.

Developers:

This changes hadoop_verify_user to require the program’s name as part of the function call. This is incompatible with Apache Hadoop 3.0.0-alpha1.


Introduces a new configuration property, yarn.resourcemanager.amlauncher.log.command. If this property is set to true, then the AM command being launched will be masked in the RM log.


The original implementation of HDFS ACLs applied the client’s umask to the permissions when inheriting a default ACL defined on a parent directory. This behavior is a deviation from the POSIX ACL specification, which states that the umask has no influence when a default ACL propagates from parent to child. HDFS now offers the capability to ignore the umask in this case for improved compliance with POSIX. This change is considered backward-incompatible, so the new behavior is off by default and must be explicitly configured by setting dfs.namenode.posix.acl.inheritance.enabled to true in hdfs-site.xml. Please see the HDFS Permissions Guide for further details.


Users: * Ability to set per-command+sub-command options from the command line. * Makes daemon environment variable options consistent across the project. (See deprecation list below) * HADOOP_CLIENT_OPTS is now honored for every non-daemon sub-command. Prior to this change, many sub-commands did not use it.

Developers: * No longer need to do custom handling for options in the case section of the shell scripts. * Consolidates all _OPTS handling into hadoop-functions.sh to enable future projects. * All daemons running with secure mode features now get _SECURE_EXTRA_OPTS support.

_OPTS Changes:

Old New
HADOOP_BALANCER_OPTS HDFS_BALANCER_OPTS
HADOOP_DATANODE_OPTS HDFS_DATANODE_OPTS
HADOOP_DN_SECURE_EXTRA_OPTS HDFS_DATANODE_SECURE_EXTRA_OPTS
HADOOP_JOB_HISTORYSERVER_OPTS MAPRED_HISTORYSERVER_OPTS
HADOOP_JOURNALNODE_OPTS HDFS_JOURNALNODE_OPTS
HADOOP_MOVER_OPTS HDFS_MOVER_OPTS
HADOOP_NAMENODE_OPTS HDFS_NAMENODE_OPTS
HADOOP_NFS3_OPTS HDFS_NFS3_OPTS
HADOOP_NFS3_SECURE_EXTRA_OPTS HDFS_NFS3_SECURE_EXTRA_OPTS
HADOOP_PORTMAP_OPTS HDFS_PORTMAP_OPTS
HADOOP_SECONDARYNAMENODE_OPTS HDFS_SECONDARYNAMENODE_OPTS
HADOOP_ZKFC_OPTS HDFS_ZKFC_OPTS

Conf HTTP service should set response’s content type according to the Accept header in the request.


WARNING: No release note provided for this change.


WARNING: No release note provided for this change.


The configuration dfs.encryption.key.provider.uri is deprecated. To configure key provider in HDFS, please use hadoop.security.key.provider.path.


A new protobuf field added to RemoteEditLogManifest was mistakenly marked as required. This changes the field to optional, preserving compatibility with 2.x releases but breaking compatibility with 3.0.0-alpha1.


The remaining classes in the org.apache.hadoop.hdfs.client package have been moved from hadoop-hdfs to hadoop-hdfs-client.


Changed Apache Kafka dependency from kafka-2.10 to kafka-clients in hadoop-kafka module.


S3A now provides a working implementation of the FileSystem#createNonRecursive method.


If pipeline recovery fails due to expired encryption key, attempt to refresh the key and retry.


Jackson 1.9.13 dependency was removed from hadoop-tools module.


The default value of yarn.app.mapreduce.client.job.max-retries has been changed from 0 to 3. This will help protect clients from failures that are transient. True failures may take slightly longer now due to the retries.


Disk usage summaries previously incorrectly counted files twice if they had been renamed (including files moved to Trash) since being snapshotted. Summaries now include current data plus snapshotted data that is no longer under the directory either due to deletion or being moved outside of the directory.


This changes the config var cycle detection introduced in 3.0.0-alpha1 by HADOOP-6871 such that it detects single-variable but not multi-variable loops. This also fixes resolution of multiple specifications of the same variable in a config value.


WARNING: No release note provided for this change.


EC policy is now stored in the “system” extended attribute namespace rather than “raw”. This means the EC policy extended attribute is no longer directly accessible by users or preserved across a distcp that preserves raw extended attributes.

Users can instead use HdfsAdmin#setErasureCodingPolicy and HdfsAdmin#getErasureCodingPolicy to set and get the EC policy for a path.


The maximum applications the RM stores in memory and in the state-store by default has been lowered from 10,000 to 1,000. This should ease the pressure on the state-store. However, installations relying on the default to be 10,000 are affected.


If root path / is an encryption zone, the old DistributedFileSystem#getTrashRoot(new Path(“/”)) returns /user/$USER/.Trash which is a wrong behavior. The correct value should be /.Trash/$USER


The unused method getTrashCanLocation has been removed. This method has long been superceded by FileSystem#getTrashRoot.


Bump HTrace version from 4.0.1-incubating to 4.1.0-incubating.


The BookkeeperJournalManager implementation has been removed. Users are encouraged to use QuorumJournalManager instead.


Added permissions to the fs stat command. They are now available as symbolic (%A) and octal (%a) formats, which are in line with Linux.


WARNING: No release note provided for this change.


This mechanism replaces the (experimental) fast output stream of Hadoop 2.7.x, combining better scalability options with instrumentation. Consult the S3A documentation to see the extra configuration operations.


An unnecessary dependency on hadoop-mapreduce-client-shuffle in hadoop-mapreduce-client-jobclient has been removed.


Change FileSystem#listStatus contract to never return null. Local filesystems prior to 3.0.0 returned null upon access error. It is considered erroneous. We should expect FileSystem#listStatus to throw IOException upon access error.


kms-audit.log used to show an UNAUTHENTICATED message even for successful operations, because of the OPTIONS HTTP request during SPNEGO initial handshake. This message brings more confusion than help, and has hence been removed.


Improves the error message when datanode removes a replica which is not found.


Fsck now reports whether a file is replicated and erasure-coded. If it is replicated, fsck reports replication factor of the file. If it is erasure coded, fsck reports the erasure coding policy of the file.


Fixed a bug that made fsck -list-corruptfileblocks counts corrupt erasure coded files incorrectly.


DockerContainerExecutor is deprecated starting 2.9.0 and removed from 3.0.0. Please use LinuxContainerExecutor with the DockerRuntime to run Docker containers on YARN clusters.


This provides a native implementation of XOR codec by leveraging Intel ISA-L library function to achieve a better performance.


Bump the version of third party dependency jaxb-api to 2.2.11.


Interface classes has been changed to Abstract class to maintain consistency across all other protos.


This issue fixes a bug in how resources are evicted from the PUBLIC and PRIVATE yarn local caches used by the node manager for resource localization. In summary, the caches are now properly cleaned based on an LRU policy across both the public and private caches.


HDFS audit logs are formatted as individual lines, each of which has a few of key-value pair fields. Some of the values come from client request (e.g. src, dst). Before this patch the control characters including \t \n etc are not escaped in audit logs. That may break lines unexpectedly or introduce additional table character (in the worst case, both) within a field. Tools that parse audit logs had to deal with this case carefully. After this patch, the control characters in the src/dst fields are escaped.


Hadoop’s javadoc jars should be significantly smaller, and contain only javadoc.

As a related cleanup, the dummy hadoop-dist-* jars are no longer generated as part of the build.


FileSystem#getDefaultUri will throw IllegalArgumentException if default FS has no scheme and can not be fixed.


“getTrashRoot” returns a trash root for a path. Currently in DFS if the path “/foo” is a normal path, it returns “/user/$USER/.Trash” for “/foo” and if “/foo” is an encrypted zone, it returns “/foo/.Trash/$USER” for the child file/dir of “/foo”. This patch is about to override the old “getTrashRoot” of httpfs and webhdfs, so that the behavior of returning trash root in httpfs and webhdfs are consistent with DFS.


Removed jackson 1.9.13 dependency from hadoop-hdfs-project module.


This change reverts YARN-5567 from 3.0.0-alpha1. The exit codes of the health check script are once again ignored.


Strict validations will be done for mandatory parameters for WebHDFS REST requests.


ViewFileSystem#getServerDefaults(Path) throws NotInMountException instead of FileNotFoundException for unmounted path.


The hadoop fs -ls command now prints “Permission denied” rather than “No such file or directory” when the user doesn’t have permission to traverse the path.


Load last partial chunk checksum properly into memory when converting a finalized/temporary replica to rbw replica. This ensures concurrent reader reads the correct checksum that matches the data before the update.


This change reverts YARN-5287 from 3.0.0-alpha1. chmod clears the set-group-ID bit of a regular file hence folder was getting reset with the rights.


Bump commons-configuration version from 1.6 to 2.1


We are sorry for causing pain for everyone for whom this Jackson update causes problems, but it was proving impossible to stay on the older version: too much code had moved past it, and by staying back we were limiting what Hadoop could do, and giving everyone who wanted an up to date version of Jackson a different set of problems. We’ve selected Jackson 2.7.8 as it fixed fix a security issue in XML parsing, yet proved compatible at the API level with the Hadoop codebase –and hopefully everything downstream.


Jackson 1.9.13 dependency was removed from hadoop-yarn-project.


The dependency on the AWS SDK has been bumped to 1.11.45.


The default sync interval within new SequenceFile writes is now 100KB, up from the older default of 2000B. The sync interval is now also manually configurable via the SequenceFile.Writer API.


This introduced a new erasure coding policy named XOR-2-1-64k using the simple XOR codec, and it can be used to evaluate HDFS erasure coding feature in a small cluster (only 2 + 1 datanodes needed). The policy isn’t recommended to be used in a production cluster.


As part of this patch, the Google test framework code was updated to v1.8.0


Removed Jackson 1.9.13 dependency from hadoop-common module.


Tomcat 6.0.46 starts to filter weak ciphers. Some old SSL clients may be affected. It is recommended to upgrade the SSL client. Run the SSL client against https://www.howsmyssl.com/a/check to find out its TLS version and cipher suites.


The default value of “dfs.namenode.fs-limits.max-blocks-per-file” has been reduced from 1M to 10K.


A reencryptEncryptedKey interface is added to the KMS, to re-encrypt an encrypted key with the latest version of encryption key.


Jackson 1.9.13 dependency was removed from hadoop-maven-plugins module.


hadoop-mapreduce-client-core module now creates and distributes test jar.


The DataNode and NameNode MXBean interfaces have been marked as Private and Stable to indicate that although users should not be implementing these interfaces directly, the information exposed by these interfaces is part of the HDFS public API.


The fix for HDFS-11056 reads meta file to load last partial chunk checksum when a block is converted from finalized/temporary to rbw. However, it did not close the file explicitly, which may cause number of open files reaching system limit. This jira fixes it by closing the file explicitly after the meta file is read.


The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

HADOOP-11804 adds new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. This avoids leaking Hadoop’s dependencies onto the application’s classpath.


Fixed a race condition that caused VolumeScanner to recognize a good replica as a bad one if the replica is also being written concurrently.


The following environment variables are deprecated. Set the corresponding configuration properties instead.

Environment Variable Configuration Property Configuration File
KMS_HTTP_PORT hadoop.kms.http.port kms-site.xml
KMS_MAX_HTTP_HEADER_SIZE hadoop.http.max.request.header.size and hadoop.http.max.response.header.size kms-site.xml
KMS_MAX_THREADS hadoop.http.max.threads kms-site.xml
KMS_SSL_ENABLED hadoop.kms.ssl.enabled kms-site.xml
KMS_SSL_KEYSTORE_FILE ssl.server.keystore.location ssl-server.xml
KMS_SSL_KEYSTORE_PASS ssl.server.keystore.password ssl-server.xml
KMS_TEMP hadoop.http.temp.dir kms-site.xml

These default HTTP Services have been added.

Name Description
/conf Display configuration properties
/jmx Java JMX management interface
/logLevel Get or set log level per class
/logs Display log files
/stacks Display JVM stacks
/static/index.html The static home page

The JMX path has been changed from /kms/jmx to /jmx.

Script kms.sh has been deprecated, use hadoop kms instead. The new scripts are based on the Hadoop shell scripting framework. hadoop daemonlog is supported. SSL configurations are read from ssl-server.xml.


Added two configuration key fs.ftp.data.connection.mode and fs.ftp.transfer.mode, and configure FTP data connection mode and transfer mode accordingly.


WARNING: No release note provided for this change.


Apache Hadoop is now able to switch to the appropriate user prior to launching commands so long as the command is being run with a privileged user and the appropriate set of _USER variables are defined. This re-enables sbin/start-all.sh and sbin/stop-all.sh as well as fixes the sbin/start-dfs.sh and sbin/stop-dfs.sh to work with both secure and unsecure systems.


This patch removes share/hadoop/{hadoop,hdfs,mapred,yarn}/templates directories and contents.


A workaround to avoid dependency conflict with Spark2, before a full classpath isolation solution is implemented. Skip instantiating a Timeline Service client if encountering NoClassDefFoundError.


Hadoop now supports integration with Azure Data Lake as an alternative Hadoop-compatible file system. Please refer to the Hadoop site documentation of Azure Data Lake for details on usage and configuration.


With this JIRA we are introducing distributed scheduling in YARN. In particular, we make the following contributions: - Introduce the notion of container types. GUARANTEED containers follow the semantics of the existing YARN containers. OPPORTUNISTIC ones can be seen as lower priority containers, and can be preempted in order to make space for GUARANTEED containers to run. - Queuing of tasks at the NMs. This enables us to send more containers in an NM than its available resources. At the moment we are allowing queuing of OPPORTUNISTIC containers. Once resources become available at the NM, such containers can immediately start their execution. - Introduce the AMRMProxy. This is a service running at each node, intercepting the requests between the AM and the RM. It is instrumental for both distributed scheduling and YARN Federation (YARN-2915). - Enable distributed scheduling. To minimize their allocation latency, OPPORTUNISTIC containers are dispatched immediately to NMs in a distributed fashion by using the AMRMProxy of the node where the corresponding AM resides, without needing to go through the ResourceManager.

All the functionality introduced in this JIRA is disabled by default, so it will not affect the behavior of existing applications. We have introduced parameters in YarnConfiguration to enable NM queuing (yarn.nodemanager.container-queuing-enabled), distributed scheduling (yarn.distributed-scheduling.enabled) and the AMRMProxy service (yarn.nodemanager.amrmproxy.enable). AMs currently need to specify the type of container to be requested for each task. We are in the process of adding in the MapReduce AM the ability to randomly request OPPORTUNISTIC containers for a specified percentage of a job’s tasks, so that users can experiment with the new features.