@InterfaceAudience.Public @InterfaceStability.Stable public abstract class FileOutputFormat<K,V> extends OutputFormat<K,V>
OutputFormat
s that read from FileSystem
s.Modifier and Type | Field and Description |
---|---|
protected static String |
BASE_OUTPUT_NAME |
static String |
COMPRESS |
static String |
COMPRESS_CODEC |
static String |
COMPRESS_TYPE |
static String |
OUTDIR |
protected static String |
PART |
Constructor and Description |
---|
FileOutputFormat() |
Modifier and Type | Method and Description |
---|---|
void |
checkOutputSpecs(JobContext job)
Check for validity of the output-specification for the job.
|
static boolean |
getCompressOutput(JobContext job)
Is the job output compressed?
|
Path |
getDefaultWorkFile(TaskAttemptContext context,
String extension)
Get the default path and filename for the output format.
|
OutputCommitter |
getOutputCommitter(TaskAttemptContext context)
Get the output committer for this output format.
|
static Class<? extends CompressionCodec> |
getOutputCompressorClass(JobContext job,
Class<? extends CompressionCodec> defaultValue)
Get the
CompressionCodec for compressing the job outputs. |
protected static String |
getOutputName(JobContext job)
Get the base output name for the output file.
|
static Path |
getOutputPath(JobContext job)
Get the
Path to the output directory for the map-reduce job. |
static Path |
getPathForWorkFile(TaskInputOutputContext<?,?,?,?> context,
String name,
String extension)
Helper function to generate a
Path for a file that is unique for
the task within the job output directory. |
abstract RecordWriter<K,V> |
getRecordWriter(TaskAttemptContext job)
Get the
RecordWriter for the given task. |
static String |
getUniqueFile(TaskAttemptContext context,
String name,
String extension)
Generate a unique filename, based on the task id, name, and extension
|
static Path |
getWorkOutputPath(TaskInputOutputContext<?,?,?,?> context)
Get the
Path to the task's temporary output directory
for the map-reduce job
Tasks' Side-Effect Files |
static void |
setCompressOutput(Job job,
boolean compress)
Set whether the output of the job is compressed.
|
static void |
setOutputCompressorClass(Job job,
Class<? extends CompressionCodec> codecClass)
Set the
CompressionCodec to be used to compress job outputs. |
protected static void |
setOutputName(JobContext job,
String name)
Set the base output name for output file to be created.
|
static void |
setOutputPath(Job job,
Path outputDir)
Set the
Path of the output directory for the map-reduce job. |
protected static final String BASE_OUTPUT_NAME
protected static final String PART
public static final String COMPRESS
public static final String COMPRESS_CODEC
public static final String COMPRESS_TYPE
public static final String OUTDIR
public static void setCompressOutput(Job job, boolean compress)
job
- the job to modifycompress
- should the output of the job be compressed?public static boolean getCompressOutput(JobContext job)
job
- the Job to look intrue
if the job output should be compressed,
false
otherwisepublic static void setOutputCompressorClass(Job job, Class<? extends CompressionCodec> codecClass)
CompressionCodec
to be used to compress job outputs.job
- the job to modifycodecClass
- the CompressionCodec
to be used to
compress the job outputspublic static Class<? extends CompressionCodec> getOutputCompressorClass(JobContext job, Class<? extends CompressionCodec> defaultValue)
CompressionCodec
for compressing the job outputs.job
- the Job
to look indefaultValue
- the CompressionCodec
to return if not setCompressionCodec
to be used to compress the
job outputsIllegalArgumentException
- if the class was specified, but not foundpublic abstract RecordWriter<K,V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
OutputFormat
RecordWriter
for the given task.getRecordWriter
in class OutputFormat<K,V>
job
- the information about the current task.RecordWriter
to write the output for the job.IOException
InterruptedException
public void checkOutputSpecs(JobContext job) throws FileAlreadyExistsException, IOException
OutputFormat
This is to validate the output specification for the job when it is a job is submitted. Typically checks that it does not already exist, throwing an exception when it already exists, so that output is not overwritten.
checkOutputSpecs
in class OutputFormat<K,V>
job
- information about the jobIOException
- when output should not be attemptedFileAlreadyExistsException
public static void setOutputPath(Job job, Path outputDir)
Path
of the output directory for the map-reduce job.job
- The job to modifyoutputDir
- the Path
of the output directory for
the map-reduce job.public static Path getOutputPath(JobContext job)
Path
to the output directory for the map-reduce job.Path
to the output directory for the map-reduce job.getWorkOutputPath(TaskInputOutputContext)
public static Path getWorkOutputPath(TaskInputOutputContext<?,?,?,?> context) throws IOException, InterruptedException
Path
to the task's temporary output directory
for the map-reduce job
Tasks' Side-Effect Files
Some applications need to create/write-to side-files, which differ from the actual job-outputs.
In such cases there could be issues with 2 instances of the same TIP (running simultaneously e.g. speculative tasks) trying to open/write-to the same file (path) on HDFS. Hence the application-writer will have to pick unique names per task-attempt (e.g. using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per TIP.
To get around this the Map-Reduce framework helps the application-writer out by maintaining a special ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory for each task-attempt on HDFS where the output of the task-attempt goes. On successful completion of the task-attempt the files in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.
The application-writer can take advantage of this by creating any
side-files required in a work directory during execution
of his task i.e. via
getWorkOutputPath(TaskInputOutputContext)
, and
the framework will move them out similarly - thus she doesn't have to pick
unique paths per task-attempt.
The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.
Path
to the task's temporary output directory
for the map-reduce job.IOException
InterruptedException
public static Path getPathForWorkFile(TaskInputOutputContext<?,?,?,?> context, String name, String extension) throws IOException, InterruptedException
Path
for a file that is unique for
the task within the job output directory.
The path can be used to create custom files from within the map and reduce tasks. The path name will be unique for each task. The path parent will be the job output directory.
lsThis method uses the getUniqueFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String, java.lang.String)
method to make the file name
unique for the task.
context
- the context for the task.name
- the name for the file.extension
- the extension for the fileIOException
InterruptedException
public static String getUniqueFile(TaskAttemptContext context, String name, String extension)
context
- the task that is calling thisname
- the base filenameextension
- the filename extensionpublic Path getDefaultWorkFile(TaskAttemptContext context, String extension) throws IOException
context
- the task contextextension
- an extension to add to the filenameIOException
protected static String getOutputName(JobContext job)
protected static void setOutputName(JobContext job, String name)
public OutputCommitter getOutputCommitter(TaskAttemptContext context) throws IOException
OutputFormat
getOutputCommitter
in class OutputFormat<K,V>
context
- the task contextIOException
Copyright © 2022 Apache Software Foundation. All rights reserved.