Apache Hadoop 0.20.0 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.

Add finalizeJob & terminateJob methods to JobTrackerInstrumentation class

Adds synchronization for JobTracker methods in RecoveryManager.

Disabled Chukwa unit tests for 0.20 branch only.

This patch makes TestJobHistory and its dependent testcases independent of RESTART_COUNT.

Reformatted HTML documentation for Hadoop to use submenus at the left column.

Changed RPM install location to the value specified by build.properties file.

Changed trash facility to use absolute path of the deleted file.

Changed fair scheduler UI to display minMaps and minReduces variables.

Introduced Chuckwa collection of job history.

Improved framework for data aggregation in Chuckwa.

Introduced new dfsadmin command saveNamespace to command the name service to do an immediate save of the file system image.

Changed fair scheduler to divide resources equally between pools, not jobs.

Changed history directory permissions to 750 and history file permissions to 740.

Added a new counter REDUCE_INPUT_BYTES.

Introduced distch tool for parallel ch{mod, own, grp}.

Split hadoop-default.xml into core-default.xml, hdfs-default.xml and mapreduce-default.xml.

Moved HTTP server from FSNameSystem to NameNode. Removed FSNamesystem.getNameNodeInfoPort(). Replaced FSNamesystem.getDFSNameNodeMachine() and FSNamesystem.getDFSNameNodePort() with new method FSNamesystem.getDFSNameNodeAddress(). Removed constructor NameNode(bindAddress, conf).

Changed capacity scheduler UI to better present number of running and pending tasks.

Introduced independent HSFTP proxy server for authenticated access to clusters.

Moved org.apache.hadoop.hdfs.{CreateEditsLog, NNThroughputBenchmark} to org.apache.hadoop.hdfs.server.namenode.

Changed GetFileBlockLocations to return topology information for nodes that host the block replicas.

Improved MultiFileInputFormat so that multiple blocks from the same node or same rack can be combined into a single split.

Changed processing of conf/slaves file to allow # to begin a comment.

Changed JobTracker UI to better present the number of active tasks.

Changed JobTracker web status page to display the amount of heap memory in use. This changes the JobSubmissionProtocol.

Modified Hadoop file system to no longer create S3 buckets. Applications can create buckets for their S3 file systems by other means, for example, using the JetS3t API.

This patch (1) Adds a shutdownHook that does syncLogs sothat logs of the current task are flushed and log.index is up to date in cases like System.exit(), or killed using signals(other than SIGKILL). (2) Changes writeToIndexFile() to write to a temporary index file first and then rename to log.index sothat updates to log.index file are atomic.

Improved TaskTracker blacklisting strategy to better exclude faulty tracker from executing tasks.

Introduced HttpServer method to support global filters.

Removed from class org.apache.hadoop.fs.RawLocalFileSystem deprecated methods public String getName(), public void lock(Path p, boolean shared) and public void release(Path p).

Changed KFS glue layer to allow applications to interface with multiple KFS metaservers.

Changed public class org.apache.hadoop.mapreduce.ID to be an abstract class. Removed from class org.apache.hadoop.mapreduce.ID the methods public static ID read(DataInput in) and public static ID forName(String str).

Removed Task’s dependency on concrete file systems by taking list from FileSystem class. Added statistics table to FileSystem class. Deprecated FileSystem method getStatistics(Class<? extends FileSystem> cls).

Introduced Vaidya rule based performance diagnostic tool for Map/Reduce jobs.

Modified dfsadmin -report to report under replicated blocks. blocks with corrupt replicas, and missing blocks".

Changed capacity scheduler policy to take note of task memory requirements and task tracker memory availability.

Added name node storage information to the dfshealth page, and moved data node information to a separated page.

Removed classes org.apache.hadoop.mapred.JobShell and org.apache.hadoop.mapred.TestJobShell. Removed from JobClient methods static void setCommandLineConfig(Configuration conf) and public static Configuration getCommandLineConfig().

Moved class org.apache.hadoop.mapred.StatusHttpServer to org.apache.hadoop.http.HttpServer.

Removed deprecated method parseArgs from org.apache.hadoop.fs.FileSystem.

Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.

Changed names of ganglia metrics to avoid conflicts and to better identify source function.

Changed build procedure for libhdfs to build correctly for different platforms. Build instructions are in the Jira item.

Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys.

Upgraded all core servers to use Jetty 6

Replaced parameters with context obejcts in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes.