Class FileOutputFormat<K,V>
- All Implemented Interfaces:
OutputFormat<K,V>
- Direct Known Subclasses:
MapFileOutputFormat,MultipleOutputFormat,SequenceFileOutputFormat,TextOutputFormat
OutputFormat.-
Nested Class Summary
Nested Classes -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidcheckOutputSpecs(FileSystem ignored, JobConf job) Check for validity of the output-specification for the job.static booleangetCompressOutput(JobConf conf) Is the job output compressed?static Class<? extends CompressionCodec>getOutputCompressorClass(JobConf conf, Class<? extends CompressionCodec> defaultValue) Get theCompressionCodecfor compressing the job outputs.static PathgetOutputPath(JobConf conf) Get thePathto the output directory for the map-reduce job.static PathgetPathForCustomFile(JobConf conf, String name) Helper function to generate aPathfor a file that is unique for the task within the job output directory.abstract RecordWriter<K,V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) Get theRecordWriterfor the given job.static PathgetTaskOutputPath(JobConf conf, String name) Helper function to create the task's temporary output directory and return the path to the task's output file.static StringgetUniqueName(JobConf conf, String name) Helper function to generate a name that is unique for the task.static PathgetWorkOutputPath(JobConf conf) Get thePathto the task's temporary output directory for the map-reduce job Tasks' Side-Effect Filesstatic voidsetCompressOutput(JobConf conf, boolean compress) Set whether the output of the job is compressed.static voidsetOutputCompressorClass(JobConf conf, Class<? extends CompressionCodec> codecClass) Set theCompressionCodecto be used to compress job outputs.static voidsetOutputPath(JobConf conf, Path outputDir) Set thePathof the output directory for the map-reduce job.static voidSet thePathof the task's temporary output directory for the map-reduce job.
-
Constructor Details
-
FileOutputFormat
public FileOutputFormat()
-
-
Method Details
-
setCompressOutput
Set whether the output of the job is compressed.- Parameters:
conf- theJobConfto modifycompress- should the output of the job be compressed?
-
getCompressOutput
Is the job output compressed?- Parameters:
conf- theJobConfto look in- Returns:
trueif the job output should be compressed,falseotherwise
-
setOutputCompressorClass
public static void setOutputCompressorClass(JobConf conf, Class<? extends CompressionCodec> codecClass) Set theCompressionCodecto be used to compress job outputs.- Parameters:
conf- theJobConfto modifycodecClass- theCompressionCodecto be used to compress the job outputs
-
getOutputCompressorClass
public static Class<? extends CompressionCodec> getOutputCompressorClass(JobConf conf, Class<? extends CompressionCodec> defaultValue) Get theCompressionCodecfor compressing the job outputs.- Parameters:
conf- theJobConfto look indefaultValue- theCompressionCodecto return if not set- Returns:
- the
CompressionCodecto be used to compress the job outputs - Throws:
IllegalArgumentException- if the class was specified, but not found
-
getRecordWriter
public abstract RecordWriter<K,V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException Description copied from interface:OutputFormatGet theRecordWriterfor the given job.- Specified by:
getRecordWriterin interfaceOutputFormat<K,V> job- configuration for the job whose output is being written.name- the unique name for this part of the output.progress- mechanism for reporting progress while writing to file.- Returns:
- a
RecordWriterto write the output for the job. - Throws:
IOException
-
checkOutputSpecs
public void checkOutputSpecs(FileSystem ignored, JobConf job) throws FileAlreadyExistsException, InvalidJobConfException, IOException Description copied from interface:OutputFormatCheck for validity of the output-specification for the job.This is to validate the output specification for the job when it is a job is submitted. Typically checks that it does not already exist, throwing an exception when it already exists, so that output is not overwritten.
Implementations which write to filesystems which support delegation tokens usually collect the tokens for the destination path(s) and attach them to the job configuration.- Specified by:
checkOutputSpecsin interfaceOutputFormat<K,V> job- job configuration.- Throws:
IOException- when output should not be attemptedFileAlreadyExistsExceptionInvalidJobConfException
-
setOutputPath
Set thePathof the output directory for the map-reduce job.- Parameters:
conf- The configuration of the job.outputDir- thePathof the output directory for the map-reduce job.
-
setWorkOutputPath
Set thePathof the task's temporary output directory for the map-reduce job.Note: Task output path is set by the framework.
- Parameters:
conf- The configuration of the job.outputDir- thePathof the output directory for the map-reduce job.
-
getOutputPath
Get thePathto the output directory for the map-reduce job.- Returns:
- the
Pathto the output directory for the map-reduce job. - See Also:
-
getWorkOutputPath
Get thePathto the task's temporary output directory for the map-reduce job Tasks' Side-Effect FilesNote: The following is valid only if the
OutputCommitterisFileOutputCommitter. IfOutputCommitteris not aFileOutputCommitter, the task's temporary output directory is same asgetOutputPath(JobConf)i.e.${mapreduce.output.fileoutputformat.outputdir}$Some applications need to create/write-to side-files, which differ from the actual job-outputs.
In such cases there could be issues with 2 instances of the same TIP (running simultaneously e.g. speculative tasks) trying to open/write-to the same file (path) on HDFS. Hence the application-writer will have to pick unique names per task-attempt (e.g. using the attemptid, say
attempt_200709221812_0001_m_000000_0), not just per TIP.To get around this the Map-Reduce framework helps the application-writer out by maintaining a special
${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid}sub-directory for each task-attempt on HDFS where the output of the task-attempt goes. On successful completion of the task-attempt the files in the${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid}(only) are promoted to${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.The application-writer can take advantage of this by creating any side-files required in
${mapreduce.task.output.dir}during execution of his reduce-task i.e. viagetWorkOutputPath(JobConf), and the framework will move them out similarly - thus she doesn't have to pick unique paths per task-attempt.Note: the value of
${mapreduce.task.output.dir}during execution of a particular task-attempt is actually${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}, and this value is set by the map-reduce framework. So, just create any side-files in the path returned bygetWorkOutputPath(JobConf)from map/reduce task to take advantage of this feature.The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.
- Returns:
- the
Pathto the task's temporary output directory for the map-reduce job.
-
getTaskOutputPath
Helper function to create the task's temporary output directory and return the path to the task's output file.- Parameters:
conf- job-configurationname- temporary task-output filename- Returns:
- path to the task's temporary output file
- Throws:
IOException
-
getUniqueName
Helper function to generate a name that is unique for the task.The generated name can be used to create custom files from within the different tasks for the job, the names for different tasks will not collide with each other.
The given name is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number. For example, give a name 'test' running on the first map o the job the generated name will be 'test-m-00000'.
- Parameters:
conf- the configuration for the job.name- the name to make unique.- Returns:
- a unique name accross all tasks of the job.
-
getPathForCustomFile
Helper function to generate aPathfor a file that is unique for the task within the job output directory.The path can be used to create custom files from within the map and reduce tasks. The path name will be unique for each task. The path parent will be the job output directory.
lsThis method uses the
getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)method to make the file name unique for the task.- Parameters:
conf- the configuration for the job.name- the name for the file.- Returns:
- a unique path accross all tasks of the job.
-