org.apache.hadoop.mapreduce.lib.output
Class FileOutputFormat<K,V>

java.lang.Object
  extended by org.apache.hadoop.mapreduce.OutputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<K,V>
Direct Known Subclasses:
MapFileOutputFormat, SequenceFileOutputFormat, TextOutputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class FileOutputFormat<K,V>
extends OutputFormat<K,V>

A base class for OutputFormats that read from FileSystems.


Field Summary
protected static String BASE_OUTPUT_NAME
           
static String COMPRESS
           
static String COMPRESS_CODEC
           
static String COMPRESS_TYPE
           
static String OUTDIR
           
protected static String PART
           
 
Constructor Summary
FileOutputFormat()
           
 
Method Summary
 void checkOutputSpecs(JobContext job)
          Check for validity of the output-specification for the job.
static boolean getCompressOutput(JobContext job)
          Is the job output compressed?
 Path getDefaultWorkFile(TaskAttemptContext context, String extension)
          Get the default path and filename for the output format.
 OutputCommitter getOutputCommitter(TaskAttemptContext context)
          Get the output committer for this output format.
static Class<? extends CompressionCodec> getOutputCompressorClass(JobContext job, Class<? extends CompressionCodec> defaultValue)
          Get the CompressionCodec for compressing the job outputs.
protected static String getOutputName(JobContext job)
          Get the base output name for the output file.
static Path getOutputPath(JobContext job)
          Get the Path to the output directory for the map-reduce job.
static Path getPathForWorkFile(TaskInputOutputContext<?,?,?,?> context, String name, String extension)
          Helper function to generate a Path for a file that is unique for the task within the job output directory.
abstract  RecordWriter<K,V> getRecordWriter(TaskAttemptContext job)
          Get the RecordWriter for the given task.
static String getUniqueFile(TaskAttemptContext context, String name, String extension)
          Generate a unique filename, based on the task id, name, and extension
static Path getWorkOutputPath(TaskInputOutputContext<?,?,?,?> context)
          Get the Path to the task's temporary output directory for the map-reduce job

Tasks' Side-Effect Files

static void setCompressOutput(Job job, boolean compress)
          Set whether the output of the job is compressed.
static void setOutputCompressorClass(Job job, Class<? extends CompressionCodec> codecClass)
          Set the CompressionCodec to be used to compress job outputs.
protected static void setOutputName(JobContext job, String name)
          Set the base output name for output file to be created.
static void setOutputPath(Job job, Path outputDir)
          Set the Path of the output directory for the map-reduce job.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BASE_OUTPUT_NAME

protected static final String BASE_OUTPUT_NAME
See Also:
Constant Field Values

PART

protected static final String PART
See Also:
Constant Field Values

COMPRESS

public static final String COMPRESS
See Also:
Constant Field Values

COMPRESS_CODEC

public static final String COMPRESS_CODEC
See Also:
Constant Field Values

COMPRESS_TYPE

public static final String COMPRESS_TYPE
See Also:
Constant Field Values

OUTDIR

public static final String OUTDIR
See Also:
Constant Field Values
Constructor Detail

FileOutputFormat

public FileOutputFormat()
Method Detail

setCompressOutput

public static void setCompressOutput(Job job,
                                     boolean compress)
Set whether the output of the job is compressed.

Parameters:
job - the job to modify
compress - should the output of the job be compressed?

getCompressOutput

public static boolean getCompressOutput(JobContext job)
Is the job output compressed?

Parameters:
job - the Job to look in
Returns:
true if the job output should be compressed, false otherwise

setOutputCompressorClass

public static void setOutputCompressorClass(Job job,
                                            Class<? extends CompressionCodec> codecClass)
Set the CompressionCodec to be used to compress job outputs.

Parameters:
job - the job to modify
codecClass - the CompressionCodec to be used to compress the job outputs

getOutputCompressorClass

public static Class<? extends CompressionCodec> getOutputCompressorClass(JobContext job,
                                                                         Class<? extends CompressionCodec> defaultValue)
Get the CompressionCodec for compressing the job outputs.

Parameters:
job - the Job to look in
defaultValue - the CompressionCodec to return if not set
Returns:
the CompressionCodec to be used to compress the job outputs
Throws:
IllegalArgumentException - if the class was specified, but not found

getRecordWriter

public abstract RecordWriter<K,V> getRecordWriter(TaskAttemptContext job)
                                           throws IOException,
                                                  InterruptedException
Description copied from class: OutputFormat
Get the RecordWriter for the given task.

Specified by:
getRecordWriter in class OutputFormat<K,V>
Parameters:
job - the information about the current task.
Returns:
a RecordWriter to write the output for the job.
Throws:
IOException
InterruptedException

checkOutputSpecs

public void checkOutputSpecs(JobContext job)
                      throws FileAlreadyExistsException,
                             IOException
Description copied from class: OutputFormat
Check for validity of the output-specification for the job.

This is to validate the output specification for the job when it is a job is submitted. Typically checks that it does not already exist, throwing an exception when it already exists, so that output is not overwritten.

Specified by:
checkOutputSpecs in class OutputFormat<K,V>
Parameters:
job - information about the job
Throws:
IOException - when output should not be attempted
FileAlreadyExistsException

setOutputPath

public static void setOutputPath(Job job,
                                 Path outputDir)
Set the Path of the output directory for the map-reduce job.

Parameters:
job - The job to modify
outputDir - the Path of the output directory for the map-reduce job.

getOutputPath

public static Path getOutputPath(JobContext job)
Get the Path to the output directory for the map-reduce job.

Returns:
the Path to the output directory for the map-reduce job.
See Also:
getWorkOutputPath(TaskInputOutputContext)

getWorkOutputPath

public static Path getWorkOutputPath(TaskInputOutputContext<?,?,?,?> context)
                              throws IOException,
                                     InterruptedException
Get the Path to the task's temporary output directory for the map-reduce job

Tasks' Side-Effect Files

Some applications need to create/write-to side-files, which differ from the actual job-outputs.

In such cases there could be issues with 2 instances of the same TIP (running simultaneously e.g. speculative tasks) trying to open/write-to the same file (path) on HDFS. Hence the application-writer will have to pick unique names per task-attempt (e.g. using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per TIP.

To get around this the Map-Reduce framework helps the application-writer out by maintaining a special ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory for each task-attempt on HDFS where the output of the task-attempt goes. On successful completion of the task-attempt the files in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.

The application-writer can take advantage of this by creating any side-files required in a work directory during execution of his task i.e. via getWorkOutputPath(TaskInputOutputContext), and the framework will move them out similarly - thus she doesn't have to pick unique paths per task-attempt.

The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.

Returns:
the Path to the task's temporary output directory for the map-reduce job.
Throws:
IOException
InterruptedException

getPathForWorkFile

public static Path getPathForWorkFile(TaskInputOutputContext<?,?,?,?> context,
                                      String name,
                                      String extension)
                               throws IOException,
                                      InterruptedException
Helper function to generate a Path for a file that is unique for the task within the job output directory.

The path can be used to create custom files from within the map and reduce tasks. The path name will be unique for each task. The path parent will be the job output directory.

ls

This method uses the getUniqueFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String, java.lang.String) method to make the file name unique for the task.

Parameters:
context - the context for the task.
name - the name for the file.
extension - the extension for the file
Returns:
a unique path accross all tasks of the job.
Throws:
IOException
InterruptedException

getUniqueFile

public static String getUniqueFile(TaskAttemptContext context,
                                   String name,
                                   String extension)
Generate a unique filename, based on the task id, name, and extension

Parameters:
context - the task that is calling this
name - the base filename
extension - the filename extension
Returns:
a string like $name-[mrsct]-$id$extension

getDefaultWorkFile

public Path getDefaultWorkFile(TaskAttemptContext context,
                               String extension)
                        throws IOException
Get the default path and filename for the output format.

Parameters:
context - the task context
extension - an extension to add to the filename
Returns:
a full path $output/_temporary/$taskid/part-[mr]-$id
Throws:
IOException

getOutputName

protected static String getOutputName(JobContext job)
Get the base output name for the output file.


setOutputName

protected static void setOutputName(JobContext job,
                                    String name)
Set the base output name for output file to be created.


getOutputCommitter

public OutputCommitter getOutputCommitter(TaskAttemptContext context)
                                   throws IOException
Description copied from class: OutputFormat
Get the output committer for this output format. This is responsible for ensuring the output is committed correctly.

Specified by:
getOutputCommitter in class OutputFormat<K,V>
Parameters:
context - the task context
Returns:
an output committer
Throws:
IOException


Copyright © 2014 Apache Software Foundation. All Rights Reserved.