@InterfaceAudience.Public @InterfaceStability.Stable public abstract class FileOutputFormat<K,V> extends Object implements OutputFormat<K,V>
OutputFormat
.Constructor and Description |
---|
FileOutputFormat() |
Modifier and Type | Method and Description |
---|---|
void |
checkOutputSpecs(FileSystem ignored,
JobConf job)
Check for validity of the output-specification for the job.
|
static boolean |
getCompressOutput(JobConf conf)
Is the job output compressed?
|
static Class<? extends CompressionCodec> |
getOutputCompressorClass(JobConf conf,
Class<? extends CompressionCodec> defaultValue)
Get the
CompressionCodec for compressing the job outputs. |
static Path |
getOutputPath(JobConf conf)
Get the
Path to the output directory for the map-reduce job. |
static Path |
getPathForCustomFile(JobConf conf,
String name)
Helper function to generate a
Path for a file that is unique for
the task within the job output directory. |
abstract RecordWriter<K,V> |
getRecordWriter(FileSystem ignored,
JobConf job,
String name,
Progressable progress)
Get the
RecordWriter for the given job. |
static Path |
getTaskOutputPath(JobConf conf,
String name)
Helper function to create the task's temporary output directory and
return the path to the task's output file.
|
static String |
getUniqueName(JobConf conf,
String name)
Helper function to generate a name that is unique for the task.
|
static Path |
getWorkOutputPath(JobConf conf)
Get the
Path to the task's temporary output directory
for the map-reduce job
Tasks' Side-Effect Files |
static void |
setCompressOutput(JobConf conf,
boolean compress)
Set whether the output of the job is compressed.
|
static void |
setOutputCompressorClass(JobConf conf,
Class<? extends CompressionCodec> codecClass)
Set the
CompressionCodec to be used to compress job outputs. |
static void |
setOutputPath(JobConf conf,
Path outputDir)
Set the
Path of the output directory for the map-reduce job. |
public static void setCompressOutput(JobConf conf, boolean compress)
conf
- the JobConf
to modifycompress
- should the output of the job be compressed?public static boolean getCompressOutput(JobConf conf)
conf
- the JobConf
to look intrue
if the job output should be compressed,
false
otherwisepublic static void setOutputCompressorClass(JobConf conf, Class<? extends CompressionCodec> codecClass)
CompressionCodec
to be used to compress job outputs.conf
- the JobConf
to modifycodecClass
- the CompressionCodec
to be used to
compress the job outputspublic static Class<? extends CompressionCodec> getOutputCompressorClass(JobConf conf, Class<? extends CompressionCodec> defaultValue)
CompressionCodec
for compressing the job outputs.conf
- the JobConf
to look indefaultValue
- the CompressionCodec
to return if not setCompressionCodec
to be used to compress the
job outputsIllegalArgumentException
- if the class was specified, but not foundpublic abstract RecordWriter<K,V> getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException
OutputFormat
RecordWriter
for the given job.getRecordWriter
in interface OutputFormat<K,V>
job
- configuration for the job whose output is being written.name
- the unique name for this part of the output.progress
- mechanism for reporting progress while writing to file.RecordWriter
to write the output for the job.IOException
public void checkOutputSpecs(FileSystem ignored, JobConf job) throws FileAlreadyExistsException, InvalidJobConfException, IOException
OutputFormat
This is to validate the output specification for the job when it is a job is submitted. Typically checks that it does not already exist, throwing an exception when it already exists, so that output is not overwritten.
Implementations which write to filesystems which support delegation tokens usually collect the tokens for the destination path(s) and attach them to the job configuration.checkOutputSpecs
in interface OutputFormat<K,V>
job
- job configuration.IOException
- when output should not be attemptedFileAlreadyExistsException
InvalidJobConfException
public static void setOutputPath(JobConf conf, Path outputDir)
Path
of the output directory for the map-reduce job.conf
- The configuration of the job.outputDir
- the Path
of the output directory for
the map-reduce job.public static Path getOutputPath(JobConf conf)
Path
to the output directory for the map-reduce job.Path
to the output directory for the map-reduce job.getWorkOutputPath(JobConf)
public static Path getWorkOutputPath(JobConf conf)
Path
to the task's temporary output directory
for the map-reduce job
Tasks' Side-Effect Files
Note: The following is valid only if the OutputCommitter
is FileOutputCommitter
. If OutputCommitter
is not
a FileOutputCommitter
, the task's temporary output
directory is same as getOutputPath(JobConf)
i.e.
${mapreduce.output.fileoutputformat.outputdir}$
Some applications need to create/write-to side-files, which differ from the actual job-outputs.
In such cases there could be issues with 2 instances of the same TIP (running simultaneously e.g. speculative tasks) trying to open/write-to the same file (path) on HDFS. Hence the application-writer will have to pick unique names per task-attempt (e.g. using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per TIP.
To get around this the Map-Reduce framework helps the application-writer out by maintaining a special ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory for each task-attempt on HDFS where the output of the task-attempt goes. On successful completion of the task-attempt the files in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.
The application-writer can take advantage of this by creating any
side-files required in ${mapreduce.task.output.dir} during execution
of his reduce-task i.e. via getWorkOutputPath(JobConf)
, and the
framework will move them out similarly - thus she doesn't have to pick
unique paths per task-attempt.
Note: the value of ${mapreduce.task.output.dir} during
execution of a particular task-attempt is actually
${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}, and this value is
set by the map-reduce framework. So, just create any side-files in the
path returned by getWorkOutputPath(JobConf)
from map/reduce
task to take advantage of this feature.
The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.
Path
to the task's temporary output directory
for the map-reduce job.public static Path getTaskOutputPath(JobConf conf, String name) throws IOException
conf
- job-configurationname
- temporary task-output filenameIOException
public static String getUniqueName(JobConf conf, String name)
The generated name can be used to create custom files from within the different tasks for the job, the names for different tasks will not collide with each other.
The given name is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number. For example, give a name 'test' running on the first map o the job the generated name will be 'test-m-00000'.
conf
- the configuration for the job.name
- the name to make unique.public static Path getPathForCustomFile(JobConf conf, String name)
Path
for a file that is unique for
the task within the job output directory.
The path can be used to create custom files from within the map and reduce tasks. The path name will be unique for each task. The path parent will be the job output directory.
lsThis method uses the getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)
method to make the file name
unique for the task.
conf
- the configuration for the job.name
- the name for the file.Copyright © 2023 Apache Software Foundation. All rights reserved.