Apache Hadoop 0.20.1 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.

This patch resets the variable totalBytesProcessed before the final merge sothat it will be used for calculating the progress of reducePhase(the 3rd phase of reduce task) correctly.

Removed pre-emption from capacity scheduler. The impact of this change is that capacities for queues can no longer be guaranteed within a given span of time. Also changed configuration variables to remove pre-emption related variables and better reflect the absence of guarantees.

WARNING: No release note provided for this change.

Post HADOOP-4372, empty job history files caused NPE. This issues fixes that by creating new files if no old file is found.

If the child (streaming) process returns successfully and the MROutputThread throws an error, there was no way to detect that as all the IOExceptions was ignored. Such issues can occur when DFS clients were closed etc. Now a check for errors (in threads) is made before finishing off the task and an exception is thrown that fails he task.

Fixes Capacity scheduler to account more capacity of a queue for a high memory job. Done by considering these jobs to take more slots proportionally with respect to a slot’s default memory size.

Jobtracker crashes if it fails to create jobtracker.info file (i.e if sufficient datanodes are not up). With this patch it keeps on retrying on IOExceptions assuming IOExceptions in jobtracker.info creation implies that the hdfs is not in *ready *state.

TestJobHistory fails as jobtracker is restarted very fast (within a minute) and history files from earlier testcases were not cleaned up. This patch cleans up the history-dir and mapred-system-dir after every test.

Add a new, binary file format TFile.

KeyFieldBasedPartitioner throws ArrayOutOfIndex when passed an empty key. This patch hashes empty key to 0 hashcode.

When a job is initialized, it localizes the job conf to the logs dir. Without this patch I never gets deleted. Now when the job retires, the conf is deleted. This local copy is required to display on the webui.

CompletedJobStatusStore was hardcored to persist to hdfs. This patch allows to persist to local fs. Just qualify mapred.job.tracker.persist.jobstatus.dir with file://

Provide a new option to rm and rmr, -skipTrash, which will immediately delete the files specified, rather than moving them to the trash.

This patch adds the mapid and reduceid in the http header of mapoutput when being sent to reduce node. Also validates compressed length, decompressed length, mapid and reduceid from http header at reduce node.

Fixed a bug in Pipes combiner to reset the spilled bytes count after the spill.

Fixed backwards compatibility by re-introducing and deprecating removed memory monitoring related configuration options.

Multithreaded mapper was modified to create a new Runtime exception (object) from a throwable instead of casting a throwable into a RuntimeException, once the Multithreaded map encounters a fault.

Fixed a bug in the way commit of task outputs happens. The bug was that if commit fails with IOException, the task would be declared as successful.

Job initialization process was changed to not change (run) states during initialization. The reason is two fold - this can lead to deadlock as state changes require circular locking (i.e JobInProgress requires JobTracker lock) - events were not raised as these state changes were not informed/propogated back to the JobTracker

Now the JobTracker takes care of initializing/failing/killing the job and raising appropriate events. The simple rule that was enforced was that “The JobTracker lock is *must* before changing the run-state of a job”.

Reduced the frequency of log messages printed when a deprecated memory management variable is found in configuration of a job.

JobTracker was changed to take an identifier as an argument. This helps in testcases where the jobtracker/mapred-cluster is (re)started in a short span of time and the chances of jobtracker identifier clashing are high. Also the RecoveryManager was modified to throw an exception if a job fails in init during the recovery process. The reason being that this event will trigger a job failure in the recovery process and will remove the failed job from further initialization and processing.

The tasktracker’s startup code was modified to use deprecated memory management configuration variables, when specified, and enable memory monitoring of tasks.

Fixed a bug in the new org.apache.hadoop.mapreduce.Counters.getGroup() method to return an empty group if group name doesn’t exist, instead of null, thus making sure that it is in sync with the Javadoc.

The JobTracker tries to delete the mapred.system.dir when it is starting up (with the job recovery disabled). The fix provided by this jira is that JobTracker will fail (bail out) with AccessControlException if it fails to delete files/directories in mapred.system.dir due to access control issues.

Removes the dependency of hadoop-mapred from commons-cli2 and uses commons-cli1.2 for command-line parsing.

GenericOptionsParser in branch 0.20 depends on commons-cli2. This jira removes the dependency of branch 0.20 on commons-cli2 completely. The problem is seen after ‘ant binary’ where all the library files are copied to ‘$hadoop-home/lib’ which already has commons-cli2.

Various code paths in the framework caught Throwable and tried to do inline cleanup. In case of OOM errors, such inline-cleanups can result into hung jvms. With this fix, the TaskTracker provides a api to report fatal errors (any throwable other than FSErrror and Exceptions). On catching a Throwable, Mapper/Reducer tries to inform the TT.