Apache Hadoop 0.20.0 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.

Changed KFS glue layer to allow applications to interface with multiple KFS metaservers.

Changed public class org.apache.hadoop.mapreduce.ID to be an abstract class. Removed from class org.apache.hadoop.mapreduce.ID the methods public static ID read(DataInput in) and public static ID forName(String str).

Removed from class org.apache.hadoop.fs.RawLocalFileSystem deprecated methods public String getName(), public void lock(Path p, boolean shared) and public void release(Path p).

Introduced HttpServer method to support global filters.

Changed processing of conf/slaves file to allow # to begin a comment.

Moved org.apache.hadoop.hdfs.{CreateEditsLog, NNThroughputBenchmark} to org.apache.hadoop.hdfs.server.namenode.

Introduced independent HSFTP proxy server for authenticated access to clusters.

Moved HTTP server from FSNameSystem to NameNode. Removed FSNamesystem.getNameNodeInfoPort(). Replaced FSNamesystem.getDFSNameNodeMachine() and FSNamesystem.getDFSNameNodePort() with new method FSNamesystem.getDFSNameNodeAddress(). Removed constructor NameNode(bindAddress, conf).

Changed GetFileBlockLocations to return topology information for nodes that host the block replicas.

Changed JobTracker web status page to display the amount of heap memory in use. This changes the JobSubmissionProtocol.

Moved class org.apache.hadoop.mapred.StatusHttpServer to org.apache.hadoop.http.HttpServer.

Removed Task’s dependency on concrete file systems by taking list from FileSystem class. Added statistics table to FileSystem class. Deprecated FileSystem method getStatistics(Class<? extends FileSystem> cls).

Introduced distch tool for parallel ch{mod, own, grp}.

Upgraded all core servers to use Jetty 6

Removed classes org.apache.hadoop.mapred.JobShell and org.apache.hadoop.mapred.TestJobShell. Removed from JobClient methods static void setCommandLineConfig(Configuration conf) and public static Configuration getCommandLineConfig().

Modified Hadoop file system to no longer create S3 buckets. Applications can create buckets for their S3 file systems by other means, for example, using the JetS3t API.

Changed names of ganglia metrics to avoid conflicts and to better identify source function.

Changed capacity scheduler policy to take note of task memory requirements and task tracker memory availability.

Removed deprecated method parseArgs from org.apache.hadoop.fs.FileSystem.

Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.

Changed capacity scheduler UI to better present number of running and pending tasks.

Improved TaskTracker blacklisting strategy to better exclude faulty tracker from executing tasks.

Changed JobTracker UI to better present the number of active tasks.

Introduced Vaidya rule based performance diagnostic tool for Map/Reduce jobs.

Added name node storage information to the dfshealth page, and moved data node information to a separated page.

Added a new counter REDUCE_INPUT_BYTES.

Introduced new dfsadmin command saveNamespace to command the name service to do an immediate save of the file system image.

Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys.

Replaced parameters with context obejcts in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes.

Split hadoop-default.xml into core-default.xml, hdfs-default.xml and mapreduce-default.xml.

Changed build procedure for libhdfs to build correctly for different platforms. Build instructions are in the Jira item.

Improved framework for data aggregation in Chuckwa.

Changed fair scheduler to divide resources equally between pools, not jobs.

Introduced Chuckwa collection of job history.

Changed RPM install location to the value specified by build.properties file.

Changed trash facility to use absolute path of the deleted file.

Improved MultiFileInputFormat so that multiple blocks from the same node or same rack can be combined into a single split.

Changed fair scheduler UI to display minMaps and minReduces variables.

Modified dfsadmin -report to report under replicated blocks. blocks with corrupt replicas, and missing blocks".

Changed history directory permissions to 750 and history file permissions to 740.

This patch makes TestJobHistory and its dependent testcases independent of RESTART_COUNT.

Reformatted HTML documentation for Hadoop to use submenus at the left column.

Disabled Chukwa unit tests for 0.20 branch only.

Add finalizeJob & terminateJob methods to JobTrackerInstrumentation class

This patch (1) Adds a shutdownHook that does syncLogs sothat logs of the current task are flushed and log.index is up to date in cases like System.exit(), or killed using signals(other than SIGKILL). (2) Changes writeToIndexFile() to write to a temporary index file first and then rename to log.index sothat updates to log.index file are atomic.

Adds synchronization for JobTracker methods in RecoveryManager.