Class FileOutputFormat<K,V>

java.lang.Object
org.apache.hadoop.mapreduce.OutputFormat<K,V>
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<K,V>
Direct Known Subclasses:
MapFileOutputFormat, SequenceFileOutputFormat, TextOutputFormat

@Public @Stable public abstract class FileOutputFormat<K,V> extends OutputFormat<K,V>
A base class for OutputFormats that read from FileSystems.
  • Field Details

    • BASE_OUTPUT_NAME

      protected static final String BASE_OUTPUT_NAME
      See Also:
    • PART

      protected static final String PART
      See Also:
    • COMPRESS

      public static final String COMPRESS
      Configuration option: should output be compressed? "mapreduce.output.fileoutputformat.compress".
      See Also:
    • COMPRESS_CODEC

      public static final String COMPRESS_CODEC
      If compression is enabled, name of codec: "mapreduce.output.fileoutputformat.compress.codec".
      See Also:
    • COMPRESS_TYPE

      public static final String COMPRESS_TYPE
      Type of compression "mapreduce.output.fileoutputformat.compress.type": NONE, RECORD, BLOCK. Generally only used in SequenceFileOutputFormat.
      See Also:
    • OUTDIR

      public static final String OUTDIR
      Destination directory of work: "mapreduce.output.fileoutputformat.outputdir".
      See Also:
  • Constructor Details

    • FileOutputFormat

      public FileOutputFormat()
  • Method Details

    • setCompressOutput

      public static void setCompressOutput(Job job, boolean compress)
      Set whether the output of the job is compressed.
      Parameters:
      job - the job to modify
      compress - should the output of the job be compressed?
    • getCompressOutput

      public static boolean getCompressOutput(JobContext job)
      Is the job output compressed?
      Parameters:
      job - the Job to look in
      Returns:
      true if the job output should be compressed, false otherwise
    • setOutputCompressorClass

      public static void setOutputCompressorClass(Job job, Class<? extends CompressionCodec> codecClass)
      Set the CompressionCodec to be used to compress job outputs.
      Parameters:
      job - the job to modify
      codecClass - the CompressionCodec to be used to compress the job outputs
    • getOutputCompressorClass

      public static Class<? extends CompressionCodec> getOutputCompressorClass(JobContext job, Class<? extends CompressionCodec> defaultValue)
      Get the CompressionCodec for compressing the job outputs.
      Parameters:
      job - the Job to look in
      defaultValue - the CompressionCodec to return if not set
      Returns:
      the CompressionCodec to be used to compress the job outputs
      Throws:
      IllegalArgumentException - if the class was specified, but not found
    • getRecordWriter

      public abstract RecordWriter<K,V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
      Description copied from class: OutputFormat
      Get the RecordWriter for the given task.
      Specified by:
      getRecordWriter in class OutputFormat<K,V>
      Parameters:
      job - the information about the current task.
      Returns:
      a RecordWriter to write the output for the job.
      Throws:
      IOException
      InterruptedException
    • checkOutputSpecs

      public void checkOutputSpecs(JobContext job) throws FileAlreadyExistsException, IOException
      Description copied from class: OutputFormat
      Check for validity of the output-specification for the job.

      This is to validate the output specification for the job when it is a job is submitted. Typically checks that it does not already exist, throwing an exception when it already exists, so that output is not overwritten.

      Implementations which write to filesystems which support delegation tokens usually collect the tokens for the destination path(s) and attach them to the job context's JobConf.
      Specified by:
      checkOutputSpecs in class OutputFormat<K,V>
      Parameters:
      job - information about the job
      Throws:
      IOException - when output should not be attempted
      FileAlreadyExistsException
    • setOutputPath

      public static void setOutputPath(Job job, Path outputDir)
      Set the Path of the output directory for the map-reduce job.
      Parameters:
      job - The job to modify
      outputDir - the Path of the output directory for the map-reduce job.
    • getOutputPath

      public static Path getOutputPath(JobContext job)
      Get the Path to the output directory for the map-reduce job.
      Returns:
      the Path to the output directory for the map-reduce job.
      See Also:
    • getWorkOutputPath

      public static Path getWorkOutputPath(TaskInputOutputContext<?,?,?,?> context) throws IOException, InterruptedException
      Get the Path to the task's temporary output directory for the map-reduce job Tasks' Side-Effect Files

      Some applications need to create/write-to side-files, which differ from the actual job-outputs.

      In such cases there could be issues with 2 instances of the same TIP (running simultaneously e.g. speculative tasks) trying to open/write-to the same file (path) on HDFS. Hence the application-writer will have to pick unique names per task-attempt (e.g. using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per TIP.

      To get around this the Map-Reduce framework helps the application-writer out by maintaining a special ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory for each task-attempt on HDFS where the output of the task-attempt goes. On successful completion of the task-attempt the files in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.

      The application-writer can take advantage of this by creating any side-files required in a work directory during execution of his task i.e. via getWorkOutputPath(TaskInputOutputContext), and the framework will move them out similarly - thus she doesn't have to pick unique paths per task-attempt.

      The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to HDFS.

      Returns:
      the Path to the task's temporary output directory for the map-reduce job.
      Throws:
      IOException
      InterruptedException
    • getPathForWorkFile

      public static Path getPathForWorkFile(TaskInputOutputContext<?,?,?,?> context, String name, String extension) throws IOException, InterruptedException
      Helper function to generate a Path for a file that is unique for the task within the job output directory.

      The path can be used to create custom files from within the map and reduce tasks. The path name will be unique for each task. The path parent will be the job output directory.

      ls

      This method uses the getUniqueFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String, java.lang.String) method to make the file name unique for the task.

      Parameters:
      context - the context for the task.
      name - the name for the file.
      extension - the extension for the file
      Returns:
      a unique path accross all tasks of the job.
      Throws:
      IOException
      InterruptedException
    • getUniqueFile

      public static String getUniqueFile(TaskAttemptContext context, String name, String extension)
      Generate a unique filename, based on the task id, name, and extension
      Parameters:
      context - the task that is calling this
      name - the base filename
      extension - the filename extension
      Returns:
      a string like $name-[mrsct]-$id$extension
    • getDefaultWorkFile

      public Path getDefaultWorkFile(TaskAttemptContext context, String extension) throws IOException
      Get the default path and filename for the output format.
      Parameters:
      context - the task context
      extension - an extension to add to the filename
      Returns:
      a full path $output/_temporary/$taskid/part-[mr]-$id
      Throws:
      IOException
    • getOutputName

      protected static String getOutputName(JobContext job)
      Get the base output name for the output file.
    • setOutputName

      protected static void setOutputName(JobContext job, String name)
      Set the base output name for output file to be created.
    • getOutputCommitter

      public OutputCommitter getOutputCommitter(TaskAttemptContext context) throws IOException
      Description copied from class: OutputFormat
      Get the output committer for this output format. This is responsible for ensuring the output is committed correctly.
      Specified by:
      getOutputCommitter in class OutputFormat<K,V>
      Parameters:
      context - the task context
      Returns:
      an output committer
      Throws:
      IOException