Apache Hadoop 1.2.0 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.

Backport cache-aware improvements for PureJavaCrc32 from trunk (HADOOP-8926)

A new 4-layer network topology NetworkToplogyWithNodeGroup is available to make Hadoop more robust and efficient in virtualized environment.

This patch should be checked in together (or after) with JIRA Hadoop-8469: https://issues.apache.org/jira/browse/HADOOP-8469

This jira only allows providing paths using back slash as separator on Windows. The back slash on *nix system will be used as escape character. The support for paths using back slash as path separator will be removed in HADOOP-8139 in release 23.3.

The jsvc build target is now supported for Mac OSX and other platforms as well.

With this improvement the following options are available in release 1.2.0 and later on 1.x release stream: 1. jsvc location can be overridden by setting environment variable JSVC_HOME. Defaults to jsvc binary packaged within the Hadoop distro. 2. jsvc log output is directed to the file defined by JSVC_OUTFILE. Defaults to $HADOOP_LOG_DIR/jsvc.out. 3. jsvc error output is directed to the file defined by JSVC_ERRFILE file. Defaults to $HADOOP_LOG_DIR/jsvc.err.

With this improvement the following options are available in release 2.0.4 and later on 2.x release stream: 1. jsvc log output is directed to the file defined by JSVC_OUTFILE. Defaults to $HADOOP_LOG_DIR/jsvc.out. 2. jsvc error output is directed to the file defined by JSVC_ERRFILE file. Defaults to $HADOOP_LOG_DIR/jsvc.err.

For overriding jsvc location on 2.x releases, here is the release notes from HDFS-2303: To run secure Datanodes users must install jsvc for their platform and set JSVC_HOME to point to the location of jsvc in their environment.

This patch makes an incompatible configuration change, as described below: In releases 1.1.0 and other point releases 1.1.x, the configuration parameter “dfs.namenode.check.stale.datanode” could be used to turn on checking for the stale nodes. This configuration is no longer supported in release 1.2.0 onwards and is renamed as “dfs.namenode.avoid.read.stale.datanode”.

How feature works and configuring this feature: As described in HDFS-3703 release notes, datanode stale period can be configured using parameter “dfs.namenode.stale.datanode.interval” in seconds (default value is 30 seconds). NameNode can be configured to use this staleness information for reads using configuration “dfs.namenode.avoid.read.stale.datanode”. When this parameter is set to true, namenode picks a stale datanode as the last target to read from when returning block locations for reads. Using staleness information for writes is as described in the releases notes of HDFS-3912.

Backport HDFS-4240 to branch-1

The namenode RPC address is currently identified from configuration “fs.default.name”. In some setups where default FS is other than HDFS, the “fs.default.name” cannot be used to get the namenode address. When such a setup co-exists with HDFS, with this change namenode can be identified using a separate configuration parameter “dfs.namenode.rpc-address”.

“dfs.namenode.rpc-address”, when configured, overrides fs.default.name for identifying namenode RPC address.

The change from this jira changes the content of some of the log messages. No log message are removed. Only the content of the log messages is changed to reduce the size. If you have a tool that depends on the exact content of the log, please look at the patch and make appropriate updates to the tool.

This jira adds a new metric with name “StaleDataNodes” under metrics context “dfs” of type Gauge. This tracks the number of DataNodes marked as stale. A DataNode is marked stale when the heartbeat message from the DataNode is not received within the configured time "“dfs.namenode.stale.datanode.interval”.

Please see hdfs-default.xml documentation corresponding to “dfs.namenode.stale.datanode.interval” for more details on how to configure this feature. When this feature is not configured, this metrics would return zero.

The datanode now performs 4MB readahead by default when reading data from its disks, if the native libraries are present. This has been shown to improve performance in many workloads. The feature may be disabled by setting dfs.datanode.readahead.bytes to “0”.

New experimental API BlockPlacementPolicy allows investigating alternate rules for locating block replicas.

Ensure that mapreduce APIs are semantically consistent with mapred API w.r.t Mapper.cleanup and Reducer.cleanup; in the sense that cleanup is now called even if there is an error. The old mapred API already ensures that Mapper.close and Reducer.close are invoked during error handling. Note that it is an incompatible change, however end-users can override Mapper.run and Reducer.run to get the old (inconsistent) behaviour.

WARNING: No release note provided for this incompatible change.

Passing a cached class-loader to ResourceBundle creator to minimize counter names lookup time.

Using FairScheduler with security configured, job initialization fails. The problem is that threads in JobInitializer runs as RPC user instead of jobtracker, pre-start all the threads fix this bug

Backported new APIs to get a Job object to 1.2.0 from 2.0.0. Job API static methods Job.getInstance(), Job.getInstance(Configuration) and Job.getInstance(Configuration, jobName) are now available across both releases to avoid porting pain.

A map-task’s syslogs now carries basic info on the InputSplit it processed.