Apache Hadoop 0.21.0 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.

Removed deprecated methods DFSClient.getHints() and DFSClient.isDirectory().

Removed deprecated FileSystem methods getBlockSize(Path f), getLength(Path f), and getReplication(Path src).

Fsck now checks permissions as directories are traversed. Any user can now use fsck, but information is provided only for directories the user has permission to read.

Removed obsolete, deprecated subclasses of ChecksumFileSystem (InMemoryFileSystem, ChecksumDistributedFileSystem).

Removed deprecated method FileSystem.delete(Path).

UNIX-style sticky bit implemented for HDFS directories. When the sticky bit is set on a directory, files in that directory may be deleted or renamed only by a superuser or the file’s owner.

New logcondense option retain-master-logs indicates whether the script should delete master logs as part of its cleanup process. By default this option is false; master logs are deleted. Earlier versions of logcondense did not delete master logs.

New filesystem shell command -df reports capacity, space used and space free. Any user may execute this command without special privileges.

Changed df dfsadmin -report to list live and dead nodes, and attempt to resolve the hostname of datanode ip addresses.

Backup namenode’s web UI default page now has some useful content.

WARNING: No release note provided for this change.

Chukwa supports pipelined writers for improved extensibility.

Removed deprecated methods getName() and getNamed(String, Configuration) from FileSystem and descendant classes.

Removed deprecated FileSystem methods .

Streaming allows binary (or other non-UTF8) streams.

Fixed a synchronization bug in job history content parsing that could result in garbled history data or a ConcurrentModificationException.

Patch introduces new configuration switch dfs.name.dir.restore (boolean) enabling this functionality. Documentation needs to be updated.

UPDATE: Config key is now “dfs.namenode.name.dir.restore” for 1.x and 2.x+ versions of HDFS

Include IO offset to client trace logging output.

New example BaileyBorweinPlouffe computes digits of pi. (World record!)

All output part files are created regardless of whether the corresponding task has output.

New configuration parameter io.seqfile.local.dir for use by SequenceFile replaces mapred.local.dir.

Chukwwa Log4J appender options allow a retention policy to limit number of files.

New DFSAdmin command -restoreFailedStorage true|false|check sets policy for restoring failed fsimage/editslog volumes.

New dfsAdmin command -printTopology shows topology as understood by the namenode.

New HDFS tool JMXGet facilitates command line access to statistics via JMX.

Introduced backup node which maintains the up-to-date state of the namespace by receiving edits from the namenode, and checkpoint node, which creates checkpoints of the name space. These facilities replace the secondary namenode.

Streaming option -combiner allows any streaming command (not just Java class) to be a combiner.

Every invocation of FileSystem.newInstance() returns a newly allocated FileSystem object. This may be an incompatible change for applications that relied on FileSystem object identity.

Accessing HDFS with any ip, hostname, or proxy should work as long as it points to the interface NameNode is listening on.

New HDFS proxy server (Tomcat based) allows clients controlled access to clusters with different versions. See Hadoop-5366 for information on using curl and wget.

Zero values for dfs.socket.timeout and dfs.datanode.socket.write.timeout are now respected. Previously zero values for these parameters resulted in a 5 second timeout.

Removed deprecated NetUtils.getServerAddress.

New BinaryPartitioner that partitions BinaryComparable keys by hashing a configurable part of the bytes array corresponding to the key.

New contribution MRUnit helps authors of map-reduce programs write unit tests with JUnit.

New plugin facility for namenode and datanode instantiates classes named in new configuration properties dfs.datanode.plugins and dfs.namenode.plugins.

New server web page …/metrics allows convenient access to metrics data via JSON and text.

New Fair Scheduler configuration parameter webinterface.private.actions controls whether changes to pools and priorities are permitted from the web interface. Changes are not permitted by default.

Job Tracker queue ACLs can be changed without restarting Job Tracker.

New Offline Image Viewer (oiv) tool reads an fsimage file and writes the data in a variety of user-friendly formats, including XML.

Additional examples and documentation for HDFS Offline Image Viewer Tool show how to generate Pig-friendly data and to do analysis with Pig.

Updates streaming documentation to correct the name used for the GZipCodec.

WARNING: No release note provided for this change.

New Fair Scheduler configuration parameter sets a default limit on number of running jobs for all pools.

WARNING: No release note provided for this change.

New mradmin command -refreshNodes updates the job tracker’s node lists.

Added unit tests for verifying LinuxTaskController functionality.

Distcp will no longer start jobs that move no data.

Fixed JobTracker to use it’s own credentials instead of the job’s credentials for accessing mapred.system.dir. Also added APIs in the JobTracker to get the FileSystem objects as per the JobTracker’s configuration.

Introduced access tokens as capabilities for accessing datanodes. This change to internal protocols does not affect client applications.

Fixed error parsing job history counters after change of counter format.

New configuration parameter fs.automatic.close can be set false to disable the JVM shutdown hook that automatically closes FileSystems.

WARNING: No release note provided for this change.

New contribution Sqoop is a JDBC-based database import tool for Hadoop.

Output of hadoop fs -dus changed to be consistent with hadoop fs -du and with Linux du. Users who previously parsed this output should update their scripts. New feature hadoop fs -du -h may be used for human readable output.

Jars passed to the -libjars option of hadoop jars are no longer unpacked inside mapred.local.dir.

New DistCp option -pt preserves last modification and last access times of copied files.

Introduced a configuration parameter, mapred.heartbeats.in.second, as an expert option, that defines how many heartbeats a jobtracker can process in a second. Administrators can set this to an appropriate value based on cluster size and expected processing time on the jobtracker to achieve a balance between jobtracker scalability and latency of jobs.

Files stored on the native S3 filesystem (s3n:// URIs) now report a block size determined by the fs.s3n.block.size property (default 64MB).

New contribution Dynamic Scheduler implements dynamic priorities with a currency model. Usage instructions are in the Jira item.

Fixed the build to make sure that all the unit tests in contrib are run, regardless of the success/failure status of the previous projects’ tests.

Fixed a bug in IsolationRunner to make it work for map tasks.

New mradmin command -refreshQueues reads new configuration of ACLs and queue states from mapred-queues.xml. If the new queue state is not “running,” jobs in progress will continue, but no other jobs from that queue will be started.

New Sqoop argument –hive-import facilitates loading data into Hive.

WARNING: No release note provided for this change.

Modifies AggregateWordCount and AggregateWordHistogram examples to use the new Map/Reduce API

Added Configuration property “mapred.committer.job.setup.cleanup.needed” to specify whether job-setup and job-cleanup is needed for the job output committer. The default value is true. Added Job.setJobSetupCleanupNeeded and JobContext.getJobSetupCleanupNeeded api for setting/getting the configuration. If the configuration is set to false, no setup or cleanup will be done.

If the number of jobs per user exceeded mapred.jobtracker.completeuserjobs.maximum then the job was flushed out of the jobtracker’s memory after the job finishes min-time (hardcoded to 1 min). This caused jobclient’s fail with NPE. In this patch the min-time to retain a job is made configurable (mapred.jobtracker.retirejob.interval.min).

Added support for preemption in the fair scheduler. The new configuration options for enabling this are described in the fair scheduler documentation.

Once the job is done, the history file and associated conf file is moved to history.folder/done folder. This is done to avoid garbling the running jobs’ folder and the framework no longer gets affected with the files in the done folder. This helps in 2 was 1) ls on running folder (recovery) is faster with less files 2) changes in running folder results into FileNotFoundException.

So with existing code, the best way to keep the running folder clean is to note the id’s of running job and then move files that are not in this list to the done folder. Note that on an avg there will be 2 files in the history folder namely 1) job history file 2) conf file.

With restart, there might be more than 2 files, mostly the extra conf files. In such a case keep the oldest conf file (based on timestamp) and delete the rest. Note that this its better to do this when the jobtracker is down.

Patch increases the replication factor of _distcp_src_files to sqrt(min(maxMapsOnCluster, totalMapsInThisJob)) sothat many maps won’t access the same replica of the file _distcp_src_files at the same time.

Provides ability to run a health check script on the tasktracker nodes and blacklist nodes if they are unhealthy.

DistCp now has a “-basedir” option that allows you to set the sufix of the source path that will be copied to the destination.

Consolidate the Mock Objects used for testing in a separate class(FakeObjectUtiltities) to ease re-usability

Modifies TestTaskLimits to do unit testing instead of running jobs using MR clusters

Provided ability in the capacity scheduler to limit the number of slots that can be concurrently used per queue at any given time.

Modifies TestRackAwareTaskPlacement to not use MiniMR/DFS Cluster for testing, thereby making it a unit test

TestJobTrackerRestart failed because of stale filemanager cache (which was created once per jvm). This patch makes sure that the filemanager is inited upon every JobHistory.init() and hence upon every restart. Note that this wont happen in production as upon a restart the new jobtracker will start in a new jvm and hence a new cache will be created.

hadoop vaidya counter names LOCAL_BYTES_READ and LOCAL_BYTES_WRITTEN are changed to respectively FILE_BYTES_READ, FILE_BYTES_WRITTEN as per current hadoop counter names.

New Hadoop script command classpath prints the path to the Hadoop jar and libraries.

Ports KeyFieldBasedComparator and KeyFieldBasedPartitioner to the new Map/Reduce API

Changed log level of addition of blacklisted reason in the JobTracker log to debug instead of INFO

Ports KeyValueLineRecordReader and KeyValueTextInputFormat the new Map/Reduce API

Only one MR cluster is brought up and hence there is no scope of jobid clashing.

Modifies TestCommandLineJobSubmission to add a test for testing custom output committer and removes TestCustomOutputCommitter

Provide ability to collect statistics about tasks completed and succeeded for each tracker in time windows. The statistics is available on the jobtrackers’ nodes UI page.

TestNodeRefresh sometimes timed out. This happened because the test started a MR cluster with 2 trackers and ran a half-waiting-mapper job. Tasks that have id > total-maps/2 wait for a signal. Because of 2 trackers, the tasks got scheduled out of order (locality) and hence the job got stuck. The fix is to start only one tracker and then add a new tracker later.

Modifies TestTrackerBlacklistAcrossJobs to use mock objects for testing instead of running a full-fledged job using MiniMR clusters.

Modifies TestKillCompletedJob to rid of its dependence on MiniMR clusters and makes it a unit test

Modifies TestLostTracker to use Mock objects instead of running full-fledged jobs using the MiniMR clusters.

Expert level config properties mapred.shuffle.connect.timeout and mapred.shuffle.read.timeout that are to be used at cluster level are added by this patch.

Allow creating archives with relative paths with a -p option on the command line.

Log a job-summary at the end of a job, while allowing it to be configured to use a custom appender if desired.

WARNING: No release note provided for this change.

Added following APIs to Configuration: - public <T extends Enum<T>> T getEnum(String name, T defaultValue) - public <T extends Enum<T>> void setEnum(String name, T value)

WARNING: No release note provided for this change.

Fixes some edge cases while using speculative execution

Moves TestReduceFetchFromPartialMem out of TestReduceFetch into a separate test to enable it to be included in the commit-tests target.

Jobtracker was modified to cleanup reservations created on tasktracker nodes to support high RAM jobs, when the nodes are blacklisted.

New Avro serialization in …/io/serializer/avro.

Modifies TestUserDefinedCounters to use LocalJobRunner instead of using MiniMR cluster

Patch that ports MultipleInputs, DelegatingInputFormat, DelegatingMapper and TaggedInputSplit to the new Map/Reduce API

Ports FieldSelectionMapReduce to the new Map/Reduce API

Creates a new test to test several miscellaneous functionality at one shot instead of running a job for each, to be used as a fast test for the ant commit-tests target.

Fix job-summary logs to correctly record final status of FAILED and KILLED jobs.

Add Combiner support to MapReduceDriver in MRUnit

WARNING: No release note provided for this change.

TestNodeRefresh waits for the newly added tracker to join before starting the testing.

Enhanced -list-blacklisted-trackers to include the reason for blacklisting a node. Modified JobSubmissionProtocol’s version as the ClusterStatus is changed to have a new class. The format of the -list-blacklisted-trackers command line interface has also changed to show the reason.

Ports the SequenceFile* classes to the new Map/Reduce API

Added a new target ‘test-commit’ to the build.xml file which runs tests specified in the file src/test/commit-tests. The tests specified in src/test/commit-tests should provide maximum coverage and all the tests should run within 10mins.

Fixed a bug in the testcase TestKillSubProcesses.

Ports NLineInputFormat and MapFileOutputFormat to the new Map/Reduce API

Provides an ability to move completed job history files to a HDFS location via configuring “mapred.job.tracker.history.completed.location”. If the directory location does not already exist, it would be created by jobtracker.

Changes the heapsize for findbugs to a parameter which can be changed on the build command line.

Provides a way to configure the cache of JobStatus objects for the retired jobs. Adds an API in RunningJob to access history file url. Adds a LRU based cache for job history files loaded in memory when accessed via JobTracker web UI. Adds Retired Jobs table on the Jobtracker UI. The job move from Running to Completed/Failed table. Then job move to Retired table when it is purged from memory. The Retired table shows last 100 retired jobs. The Completed/Failed jobs table are only shown if there are non-zero jobs in the table.

MAPREDUCE-805 changed the way the job was initialized. Capacity schedulers testcases were not modified as part of MAPREDUCE-805. This patch fixes this bug.

Modified TaskTracker and related classes so that per-job local data on the TaskTracker node has right access-control. Important changes: - All files/directories of the job on the TaskTracker are now user-owned by the job-owner and group-owner by a special TaskTracker’s group. - The permissions of the file/directories are set to the most restrictive permissions possible. - Files/dirs shareable by all tasks of the job on this TT are set proper access control as soon as possible, i.e immediately after job-localization and those that are private to a single task are set access control after the corresponding task’s localization. - Also fixes MAPREDUCE-131 which is related to a bug because of which tasks hang when the taskcontroller.cfg has multiple entries for mapred.local.dir - A new configuration entry hadoop.log.dir corresponding to the hadoop.log.dir in TT’s configuration is now needed in task-controller.cfg so as to support restricted access control for userlogs of the tasks on the TaskTracker.

Support for FIFO pools added to the Fair Scheduler.

Changed the target name from “tools-jar” to “tools” in build.xml.

Datanode can continue if a volume for replica storage fails. Previously a datanode resigned if any volume failed.

Modifies LineRecordReader to report an approximate progress, instead of just returning 0, when using compressed streams.

Removed the Job Retire thread and the associated configuration parameters. Job is purged from memory as soon as the history file is copied to HDFS. Only JobStatus object is retained in the retired jobs cache.

Support new API in unit tests developed with MRUnit.

FileSystem listStatus method throws FileNotFoundException for all implementations. Application code should catch or propagate FileNotFoundException.

FileSystem.listStatus() previously returned null for empty or nonexistent directories; will now return empty array for empty directories and throw FileNotFoundException for non-existent directory. Client code should be updated for new semantics.

The semantics for dealing with non-existent paths passed to FileSystem::listStatus() were updated and solidified in HADOOP-6201 and HDFS-538. Existing code within MapReduce that relied on the previous behavior of some FileSystem implementations of returning null has been updated to catch or propagate a FileNotFoundException, per the method’s contract.

Allow logging level of map/reduce tasks to be configurable. Configuration changes: add mapred.map.child.log.level add mapred.reduce.child.log.level

Adds Reduce Attempt ID to ClientTrace log messages, and adds Reduce Attempt ID to HTTP query string sent to mapOutputServlet. Extracts partition number from attempt ID.

Ports the mapred.join library to the new Map/Reduce API

New Configuration.dumpConfiguration(Configuration, Writer) writes configuration parameters in the JSON format.

Add PipelineMapReduceDriver to MRUnit to support testing a pipeline of MapReduce passes

Extended DistributedCache to work with LocalJobRunner.

Provides an ability to dump jobtracker configuration in JSON format to standard output and exits. To dump, use hadoop jobtracker -dumpConfiguration The format of the dump is {“properties”:[{“key”:<key>,“value”:<value>,“isFinal”:<true/false>,“resource” : <resource>}] }

Modifies Gridmix2 to use the new Map/Reduce API

Support hierarchical queues in the CapacityScheduler to allow for more predictable sharing of cluster resources.

Fixed LinuxTaskController binary so that permissions of local files on TT are set correctly: user owned by the job-owner and group-owned by the group owner of the binary and _not_ the primary group of the TaskTracker.

New server web pages provide block information: corrupt_replicas_xml and block_info_xml.

Simplifies job recovery. On jobtracker restart, incomplete jobs are resubmitted and all tasks reexecute. This JIRA removes a public constructor in JobInProgress.

New LimitedByteArrayOutputStream does not expand buffer on writes.

Refactors shuffle code out of ReduceTask into separate classes in a new package(org.apache.hadoop.mapreduce.task.reduce) Incorporates MAPREDUCE-240, batches up several map output files from a TT to a reducer in a single transfer Introduces new Shuffle counters to keep track of shuffle errors

Ports MultipleOutputs to the new Map/Reduce API

TestNodeRefresh timed out as the code to do with node refresh got removed. This patch removes the testcase.

WARNING: No release note provided for this change.

Moved process tree, and memory calculator classes out of Common project into the Map/Reduce project.

Improved error message suggests using -skpTrash option when hdfs -rm fails to move to trash because of quota.

Modified TaskMemoryManager so that it logs a map/reduce task’s process-tree’s status just before it is killed when it grows out of its configured memory limits. The log dump is in the format " |- PID PPID PGRPID SESSID CMD_NAME VMEM_USAGE(BYTES) FULL_CMD_LINE".

This is useful for debugging the cause for a map/reduce task and it’s corresponding process-tree to be killed by the TaskMemoryManager.

New FileSystem method reports default parameters that would be used by server. See also HDFS-578.

New FileSystem.getServerDefaults() reports the server’s default file creation parameters.

Unit tests updated to match syntax of new configuration parameters.

New configuration option dfs.umaskmode sets umask with octal or symbolic value.

HFTP can now serve a specific byte range from a file

Annotation mechanism enables interface classification.

Deprecate o.a.h.mapred.FileAlreadyExistsException and replace it with o.a.h.fs.FileAlreadyExistsException.

BZip2 files can now be split.

Splitting support for BZip2 Text data

WARNING: No release note provided for this change.

Provide an ability to configure the compression level and strategy for codecs. Compressors need to be ‘reinited’ with new characteristics such as compression level etc. and hence an incompatible addition to the api.

Fixed TaskTracker and related classes so as to set correct and most restrictive access control for DistributedCache files/archives. - To do this, it changed the directory structure of per-job local files on a TaskTracker to the following: $mapred.local.dir `– taskTracker `– $user |- distcache `– jobcache - Distributed cache files/archives are now user-owned by the job-owner and the group-owned by the special group-owner of the task-controller binary. The files/archives are set most private permissions possible, and as soon as possible, immediately after the files/dirs are first localized on the TT. - As depicted by the new directory structure, a directory corresponding to each user is created on each TT when that particular user’s first task are assigned to the corresponding TT. These user directories remain on the TT forever are not cleaned when unused, which is targeted to be fixed via MAPREDUCE-1019. - The distributed cache files are now accessible _only_ by the user who first localized them. Sharing of these files across users is no longer possible, but is targeted for future versions via MAPREDUCE-744.

New experimental API BlockPlacementPolicy allows investigating alternate rules for locating block replicas.

Enables tcp.nodelay for RPC between Child and TaskTracker.

New DFSClient.create(…) allows option of not creating missing parent(s).

Added support for hierarchical queues in the Map/Reduce framework with the following changes: - mapred-queues.xml is modified to a new XML template as mentioned in the JIRA. - Modified JobQueueInfo to contain a handle to child queues. - Added new APIs in the client to get ‘root’ queues, so that the entire hierarchy of queues can be iterated. -Added new APIs to get the child queues for a given queue .

New DFSClient.mkdir(…) allows option of not creating missing parent(s).

Changes the Job History file format to use JSON. Simplifies the Job History Parsing logic Removes duplication of code between HistoryViewer and the JSP files History Files are now named as JobID_user Introduces a new cluster level configuration “mapreduce.cluster.jobhistory.maxage” for configuring the amount of time history files are kept before getting cleaned up The configuration “hadoop.job.history.user.location” is no longer supported.

New FileContext API introduced to replace FileSystem API. FileContext will be the version-compatible API for future releases. FileSystem API will be deprecated in the next release.

Enhance the Context Objects api to add features to find and track jobs.

Adds an API Cluster#getJobHistoryUrl(JobID jobId) to get the history url for a given job id. The API does not check for the validity of job id or existence of the history file. It just constructs the url based on history folder, job id and the current user.

Minor changes to HadoopJob.java in eclipse-plugin contrib project to accommodate changes in JobStatus (MAPREDUCE-777)

Rename and categorize configuration keys into - cluster, jobtracker, tasktracker, job, client. Constants are defined for all keys in java and code is changed to use constants instead of direct strings. All old keys are deprecated except of examples and tests. The change is incompatible because support for old keys is not provided for config keys in examples.

The input parameters for all of the servlets will have the 5 html meta characters quoted. The characters are ‘&’, ‘<’, ‘>’, ‘"’ and the apostrophe. The goal is to ensure that our web ui servlets can’t be used for cross site scripting (XSS) attacks. In particular, it blocks the frequent (especially for errors) case where the servlet echos back the parameters to the user.

New contribution Block Forensics aids investigation of missing blocks.

Modifies job history parser and Web UI to display counters

RPC can use Avro serialization.

Add native and streaming support for Vertica as an input or output format taking advantage of parallel read and write properties of the DBMS.

Extended the framework’s refresh-queue mechanism to support refresh of scheduler specific queue properties and implemented this refresh operation for some of the capacity scheduler properties. With this feature, one can refresh some of the capacity-scheduler’s queue related properties - queue capacities, user-limits per queue, max map/reduce capacity and max-jobs per user to initialize while the system is running and without restarting JT. Even after this, some features like changing enable/disable priorities, adding/removing queues are not supported in capacity-scheduler.

Added XML-based JobTracker status JSP page for metrics reporting

Changed Map-Reduce context objects to be interfaces.

Fixed null pointer error when quoting HTML in the case JSP has no parameters.

Makes taskCleanup tasks to use 1 slot even for high memory jobs.

Updates to task timing information were fixed in some code paths that led to inconsistent metering of task run times. Also, checks were added to guard against inconsistencies and record information about state if they should happen to occur.

Introduced an option to allow tasktrackers to send an out of band heartbeat on task-completion to improve job latency. A new configuration option mapreduce.tasktracker.outofband.heartbeat is defined, which can be enabled to send this heartbeat.

File system configuration keys renamed as a step toward API standardization and backward compatibility.

WARNING: No release note provided for this change.

Created new Forrest documentation for Gridmix.

Modified the scheduling logic in capacity scheduler to return a map and a reduce task per heartbeat.

Rename properly considers the case where both source and destination are over quota; operation will fail with error indication.

This patch makes TT to set HADOOP_ROOT_LOGGER to INFO,TLA by default in the environment of taskjvm and its children.

Introduced abortJob() method in OutputCommitter which will be invoked when the job fails or is killed. By default it invokes OutputCommitter.cleanupJob(). Deprecated OutputCommitter.cleanupJob() and introduced OutputCommitter.commitJob() method which will be invoked for successful jobs. Also a _SUCCESS file is created in the output folder for successful jobs. A configuration parameter mapreduce.fileoutputcommitter.marksuccessfuljobs can be set to false to disable creation of _SUCCESS file, or to true to enable creation of the _SUCCESS file.

Replaced the existing max task limits variables “mapred.capacity-scheduler.queue.<queue-name>.max.map.slots” and “mapred.capacity-scheduler.queue.<queue-name>.max.reduce.slots” with “mapred.capacity-scheduler.queue.<queue-name>.maximum-capacity” .

max task limit variables were used to throttle the queue, i.e, these were the hard limit and not allowing queue to grow further. maximum-capacity variable defines a limit beyond which a queue cannot use the capacity of the cluster. This provides a means to limit how much excess capacity a queue can use.

maximum-capacity variable behavior is different from max task limit variables, as maximum-capacity is a percentage and it grows and shrinks in absolute terms based on total cluster capacity.Also same maximum-capacity percentage is applied to both map and reduce.

Add following additional job tracker metrics: Reserved{Map, Reduce}Slots Occupied{Map, Reduce}Slots Running{Map, Reduce}Tasks Killed{Map, Reduce}Tasks

FailedJobs KilledJobs PrepJobs RunningJobs

TotalTrackers BlacklistedTrackers DecommissionedTrackers

WARNING: No release note provided for this change.

Added occupied map/reduce slots and reserved map/reduce slots to the “Cluster Summary” table on jobtracker web ui.

Fixed the distributed cache’s localizeCache to lock only the uri it is localizing.

Modified log statement in task memory monitoring thread to include task attempt id.

Corrected error where listing path no longer in name space could stop ListPathsServlet until system restarted.

Fix Jobtracker running maps/reduces metrics.

Changed some log statements that were filling up jobtracker logs to debug level.

Add full path name of the file to the under replicated block information and summary of total number of files, blocks, live and dead datanodes to metasave output.

FSOutputDataStream implement Syncable interface to provide hflush and hsync APIs to the application users.

Add new file system interface AbstractFileSystem with implementation of some file systems that delegate to old FileSystem.

Trash feature notifies user of over-quota condition rather than silently deleting files/directories; deletion can be compelled with “rm -skiptrash”.

Update the number of trackers and blacklisted trackers metrics when trackers are decommissioned.

Add HDFS implementation of AbstractFileSystem.

Record runtime exceptions in server log to facilitate fault analysis.

Fixes an issue of NPE in ProcfsBasedProcessTree in a corner case.

add mapred.fairscheduler.pool property to define which pool a job belongs to.

This patch implements an optional layer over HDFS that implements offline erasure-coding. It can be used to reduce the total storage requirements of DFS.

Corrected an error when checking quota policy that resulted in a failure to read the edits log, stopping the primary/secondary name node.

Memory leak in function hdfsFreeFileInfo in libhdfs. This bug affects fuse-dfs severely.

WARNING: No release note provided for this change.

WARNING: No release note provided for this change.

Add the Apache license header to several files that are missing it.

Fix JobSubmitter to honor user given symlink in the path.

Fixed a bug in DistributedCache, to not decrement reference counts for unreferenced files in error conditions.

If the running job is retired, then Job url is redirected to the history page. To construct the history url, JobTracker maintains the mapping of job id to history file names. The entries from mapping is purged for jobs older than mapreduce.jobtracker.jobhistory.maxage configured value.

Correct PendingDeletionBlocks metric to properly decrement counts.

Fixes the NPE in ‘refreshNodes’, ExpiryTracker thread and heartbeat. NPE occurred in the following cases - a blacklisted tracker is either decommissioned or expires. - a lost tracker gets blacklisted

WARNING: No release note provided for this change.

Fixes null handling in records returned from VerticaInputFormat

Added two expert level configuration properties. 1. “mapreduce.reduce.shuffle.notify.readerror” to know whether to send notification to JobTracker after every read error or not. If the configuration is false, read errors are treated similar to connection errors. 2. “mapreduce.reduce.shuffle.maxfetchfailures” to specify the maximum number of the fetch failures after which the failure will be notified to JobTracker.

Fixed errors that caused occasional failure of TestGridmixSubmission, and additionally refactored some Gridmix code.

HADOOP-6433. Add AsyncDiskService for asynchronous disk services.

For efficiency, TaskTrackers no longer unjar the job jar into the job cache directory. Users who previously depended on this functionality for shipping non-code dependencies can use the undocumented configuration parameter “mapreduce.job.jar.unpack.pattern” to cause specific jar contents to be unpacked.

Quotes the characters coming out of getRequestUrl and getServerName in HttpServer.java as per the specification in HADOOP-6151.

Directories specified in mapred.local.dir that can not be created now cause the TaskTracker to fail to start.

Per-pool map and reduce caps for Fair Scheduler.

Fixes a bug in linux task controller by making the paths array passed to fts_open() as null-terminated as per the man page.

Corrects the behaviour of tasks counters in case of failed tasks.Incorrect counter values can lead to bad scheduling decisions .This jira rectifies the problem by making sure decrement properly happens incase of failed tasks.

Add an api to get the visible length of a DFSDataInputStream.

Fixed DistributedCache to support sharing of the local cache files with other users on the same TaskTracker. The cache files are checked at the client side for public/private access on the file system, and that information is passed in the configuration. The TaskTrackers look at the configuration for each file during task localization, and, if the file was public on the filesystem, they are localized to a common space for sharing by all users’ tasks on the TaskTracker. Else the file is localized to the user’s private directory on the local filesystem.

New name node web UI page displays details of decommissioning progress. (dfsnodelist.jsp?whatNodes=DECOMMISSIONING)

WARNING: No release note provided for this change.

This patch allows TaskTracker reports it’s current available memory and CPU usage to JobTracker through heartbeat. The information can be used for scheduling and monitoring in the JobTracker. This patch changes the version of InterTrackerProtocal.

Fixed a bug in ReplicasMap.remove method, which compares the generation stamp of the replica removed to itself instead of the the block passed to the method to identify the replica to be removed.

Fix 3 findsbugs warnings.

Fix for a potential deadlock in the global blacklist of tasktrackers feature.

JobTracker holds stale references to TaskInProgress objects and hence indirectly holds reference to retired jobs resulting into memory leak. Only task-attempts which are yet to report their status are left behind in the memory. All the task-attempts are now removed from the JobTracker by iterating over the scheduled task-attempt ids instead of iterating over available task statuses.

new command line argument: tokensFile - path to the file with clients secret keys in JSON format

For jobs with only one reducer, the Partitioner will no longer be called. Applications depending on Partitioners modifying records for single reducer jobs will need to move this functionality elsewhere.

Adds support for Vertica 3.5 truncate table, deploy_design and numeric types.

WARNING: No release note provided for this change.

WARNING: No release note provided for this change.

WARNING: No release note provided for this change.

mapreduce.job.hdfs-servers - declares hdfs servers to be used by the job, so client can pre-fetch delegation tokens for thsese servers (comma separated list of NameNodes).

Added configuration “mapreduce.tasktracker.group”, a group name to which TaskTracker belongs. When LinuxTaskController is used, task-controller binary’s group owner should be this group. The same should be specified in task-controller.cfg also.

Improved initialization sequence so that Port Out of Range error when starting web server will less likely interrupt testing.

Added an api FileUtil.fullyDeleteContents(String dir) to delete contents of the directory.

Fixed Map/Reduce framework to not call commit task for special tasks like job setup/cleanup and task cleanup.

Fixed TaskLauncher to stop waiting for blocking slots, for a TIP that is killed / failed while it is in queue.

Add hidden configuration option “ipc.server.max.response.size” to change the default 1 MB, the maximum size when large IPC handler response buffer is reset.

Added job-level authorization to MapReduce. JobTracker will now use the cluster configuration “mapreduce.cluster.job-authorization-enabled” to enable the checks to verify the authority of access of jobs where ever needed. Introduced two job-configuration properties to specify ACLs: “mapreduce.job.acl-view-job” and “mapreduce.job.acl-modify-job”. For now, RPCs related to job-level counters, task-level counters and tasks’ diagnostic information are protected by “mapreduce.job.acl-view-job” ACL. “mapreduce.job.acl-modify-job” protects killing of a job, killing a task of a job, failing a task of a job and setting the priority of a job. Irrespective of the above two ACLs, job-owner, superuser and members of supergroup configured on JobTracker via mapred.permissions.supergroup, can do all the view and modification operations.

mapreduce.job.complete.cancel.delegation.tokens - if false - don’t cancel delegation token renewal when the job is complete, because it may be used by some other job.

HDFS-913. Rename fault injection test TestRename.java to TestFiRename.java to include it in tests run by ant target run-test-hdfs-fault-inject.

WARNING: No release note provided for this change.

HDFS-245. Adds a symlink implementation to HDFS. This complements the new symlink feature added in HADOOP-6421

Added web-authorization for the default servlets - /logs, /stacks, /logLevel, /metrics, /conf, so that only cluster administrators can access these servlets. hadoop.cluster.administrators is the new configuration in core-default.xml that can be used to specify the ACL against which an authenticated user should be verified if he/she is an administrator.

Layout version is set to -24 reflecting changes in edits log and fsimage format related to persisting delegation tokens.

Adds job-level authorization to servlets(other than history related servlets) for accessing job related info. Deprecates mapreduce.jobtracker.permissions.supergroup and adds the config mapreduce.cluster.permissions.supergroup at cluster level sothat it will be used by TaskTracker also. Authorization checks are done if authentication is succeeded and mapreduce.cluster.job-authorization-enabled is set to true.

Detailed exceptions declared in FileContext and AbstractFileSystem

MAPREDUCE-1423. Improve performance of CombineFileInputFormat when multiple pools are configured. (Dhruba Borthakur via zshao)

Servlets should quote server generated strings sent in the response

Fixes bugs in linux task controller and TaskRunner.setupWorkDir() related to handling of symlinks.

The servlets should quote server generated strings sent in the response.

When cat a directory or a non-existent file from the command line, the error message gets printed becomes cat: io.java.FileNotFoundException: File does not exist: <absolute path name>

Added web-authorization for job-history pages. This is an incompatible change - it changes the JobHistory format by adding job-acls to job-history files and JobHistory currently does not have the support to read older versions of history files.

Introduced enableJobForCleanup() api in TaskController. This api enables deletion of stray files (with no write permissions for task-tracker) from job’s work dir. Note that the behavior is similar to TaskController#enableTaskForCleanup() except the path on which the ‘chmod’ is done is the job’s work dir.

[MUMAK] Randomize the arrival of heartbeat responses

Changes the format of the message with Heap usage on the NameNode web page.

Added private configuration variables: mapred.cache.files.filesizes and mapred.cache.archives.filesizes to store sizes of distributed cache artifacts per job. This can be used by tools like Gridmix in simulation runs.

Limit the size of diagnostics-string and state-string shipped as part of task status. This will help keep the JobTracker’s memory usage under control. Diagnostic string and state string are capped to 1024 chars.

WARNING: No release note provided for this change.

Fixed a bug that failed jobs that are run by the same user who started the mapreduce system(cluster).

Fixed a bug in the testcase TestTTResourceReporting.

WARNING: No release note provided for this change.

WARNING: No release note provided for this change.

WARNING: No release note provided for this change.

Added a private configuration variable mapreduce.input.num.files, to store number of input files being processed by M/R job.

Support for fully qualified HDFS path in addition to simple unqualified path. The qualified path indicates that the path is accessible on the specific HDFS. Non qualified path is qualified in all clusters.

Fixed a bug related to resource estimation for disk-based scheduling by modifying TaskTracker to return correct map output size for the completed maps and -1 for other tasks or failures.

Removed streaming testcase which tested non-existent functionality in Streaming.

Ensure that MRReliability works with retired-jobs feature turned on.

The exceptions thrown by the RPC client no longer carries a redundant exception class name in exception message.

This issue adds Iterator<FileStatus> listStatus(Path) to FileContext, moves FileStatus[] listStatus(Path) to FileContext#Util, and adds Iterator<FileStatus> listStatusItor(Path) to AbstractFileSystem which provides a default implementation by using FileStatus[] listStatus(Path).

Fixed a bug related to access of job_conf.xml from the history web page of a job.

Fixed a race condition involving JvmRunner.kill() and KillTaskAction, which was leading to an NullPointerException causing a transient inconsistent state in JvmManager and failure of tasks.

Fixed TaskTracker so that it does not set permissions on job-log directory recursively. This fix both improves the performance of job localization as well as avoids a bug related to launching of task-cleanup attempts after TaskTracker’s restart.

Fixed a bug in tasklog servlet which displayed wrong error message about job ACLs - an access control error instead of the expected log files gone error - after task logs directory is deleted.

Fixed a testcase problem in TestJobACLs.

Commands chmod, chown and chgrp now returns non zero exit code and an error message on failure instead of returning zero.

Updated forrest documentation to reflect the changes w.r.t public and private distributed cache files.

HADOOP-6515. Make maximum number of http threads configurable (Scott Chen via zshao)

MAPREDUCE-1568. TrackerDistributedCacheManager should clean up cache in a background thread. (Scott Chen via zshao)

Fixed a bug that caused all the AdminOperationsProtocol operations to fail when service-level authorization is enabled. The problem is solved by registering AdminOperationsProtocol also with MapReducePolicyProvider.

Removed the documentation for the ‘unstable’ job-acls feature from branch 0.21.

Updated forrest documentation to reflect the changes to make localized files from DistributedCache have right access-control on TaskTrackers(MAPREDUCE-856).

Fixed initialization of a task-cleanup attempt’s log directory by setting correct permissions via task-controller. Added new log4j properties hadoop.tasklog.iscleanup and log4j.appender.TLA.isCleanup to conf/log4j.properties. Changed the userlogs for a task-cleanup attempt to go into its own directory instead of the original attempt directory. This is an incompatible change as old userlogs of cleanup attempt-dirs before this release will no longer be visible.

Fixed TaskTracker to avoid hung and unusable slots when TaskRunner crashes with NPE and leaves tasks in UNINITIALIZED state for ever.

I’ve just committed this to 0.21.

Documented the behavior of -file option in streaming and deprecated it in favor of generic -files option.

Fixed TestJobACLs test timeout failure because of no slots for launching JOB_CLEANUP task.

Updated documentation related to (1) the changes in the configuration for memory management of tasks and (2) the modifications in scheduling model of CapacityTaskScheduler to include memory requirement by tasks.

Removed configuration property “hadoop.cluster.administrators”. Added constructor public HttpServer(String name, String bindAddress, int port, boolean findPort, Configuration conf, AccessControlList adminsAcl) in HttpServer, which takes cluster administrators acl as a parameter.