org.apache.hadoop.mapreduce.lib.input
Class FileInputFormat<K,V>

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>
Direct Known Subclasses:
CombineFileInputFormat, KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat, TextInputFormat

public abstract class FileInputFormat<K,V>
extends InputFormat<K,V>

A base class for file-based InputFormats.

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext). Subclasses of FileInputFormat can also override the isSplitable(JobContext, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.


Nested Class Summary
static class FileInputFormat.Counter
           
 
Constructor Summary
FileInputFormat()
           
 
Method Summary
static void addInputPath(Job job, Path path)
          Add a Path to the list of inputs for the map-reduce job.
static void addInputPaths(Job job, String commaSeparatedPaths)
          Add the given comma separated paths to the list of inputs for the map-reduce job.
protected  long computeSplitSize(long blockSize, long minSize, long maxSize)
           
protected  int getBlockIndex(BlockLocation[] blkLocations, long offset)
           
protected  long getFormatMinSplitSize()
          Get the lower bound on split size imposed by the format.
static PathFilter getInputPathFilter(JobContext context)
          Get a PathFilter instance of the filter set for the input paths.
static Path[] getInputPaths(JobContext context)
          Get the list of input Paths for the map-reduce job.
static long getMaxSplitSize(JobContext context)
          Get the maximum split size.
static long getMinSplitSize(JobContext job)
          Get the minimum split size
 List<InputSplit> getSplits(JobContext job)
          Generate the list of files and make them into FileSplits.
protected  boolean isSplitable(JobContext context, Path filename)
          Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be.
protected  List<FileStatus> listStatus(JobContext job)
          List input directories.
static void setInputPathFilter(Job job, Class<? extends PathFilter> filter)
          Set a PathFilter to be applied to the input paths for the map-reduce job.
static void setInputPaths(Job job, Path... inputPaths)
          Set the array of Paths as the list of inputs for the map-reduce job.
static void setInputPaths(Job job, String commaSeparatedPaths)
          Sets the given comma separated paths as the list of inputs for the map-reduce job.
static void setMaxInputSplitSize(Job job, long size)
          Set the maximum split size
static void setMinInputSplitSize(Job job, long size)
          Set the minimum input split size
 
Methods inherited from class org.apache.hadoop.mapreduce.InputFormat
createRecordReader
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FileInputFormat

public FileInputFormat()
Method Detail

getFormatMinSplitSize

protected long getFormatMinSplitSize()
Get the lower bound on split size imposed by the format.

Returns:
the number of bytes of the minimal split for this format

isSplitable

protected boolean isSplitable(JobContext context,
                              Path filename)
Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.

Parameters:
context - the job context
filename - the file name to check
Returns:
is this file splitable?

setInputPathFilter

public static void setInputPathFilter(Job job,
                                      Class<? extends PathFilter> filter)
Set a PathFilter to be applied to the input paths for the map-reduce job.

Parameters:
job - the job to modify
filter - the PathFilter class use for filtering the input paths.

setMinInputSplitSize

public static void setMinInputSplitSize(Job job,
                                        long size)
Set the minimum input split size

Parameters:
job - the job to modify
size - the minimum size

getMinSplitSize

public static long getMinSplitSize(JobContext job)
Get the minimum split size

Parameters:
job - the job
Returns:
the minimum number of bytes that can be in a split

setMaxInputSplitSize

public static void setMaxInputSplitSize(Job job,
                                        long size)
Set the maximum split size

Parameters:
job - the job to modify
size - the maximum split size

getMaxSplitSize

public static long getMaxSplitSize(JobContext context)
Get the maximum split size.

Parameters:
context - the job to look at.
Returns:
the maximum number of bytes a split can include

getInputPathFilter

public static PathFilter getInputPathFilter(JobContext context)
Get a PathFilter instance of the filter set for the input paths.

Returns:
the PathFilter instance set for the job, NULL if none has been set.

listStatus

protected List<FileStatus> listStatus(JobContext job)
                               throws IOException
List input directories. Subclasses may override to, e.g., select only files matching a regular expression.

Parameters:
job - the job to list input paths for
Returns:
array of FileStatus objects
Throws:
IOException - if zero items.

getSplits

public List<InputSplit> getSplits(JobContext job)
                           throws IOException
Generate the list of files and make them into FileSplits.

Specified by:
getSplits in class InputFormat<K,V>
Parameters:
job - job configuration.
Returns:
an array of InputSplits for the job.
Throws:
IOException

computeSplitSize

protected long computeSplitSize(long blockSize,
                                long minSize,
                                long maxSize)

getBlockIndex

protected int getBlockIndex(BlockLocation[] blkLocations,
                            long offset)

setInputPaths

public static void setInputPaths(Job job,
                                 String commaSeparatedPaths)
                          throws IOException
Sets the given comma separated paths as the list of inputs for the map-reduce job.

Parameters:
job - the job
commaSeparatedPaths - Comma separated paths to be set as the list of inputs for the map-reduce job.
Throws:
IOException

addInputPaths

public static void addInputPaths(Job job,
                                 String commaSeparatedPaths)
                          throws IOException
Add the given comma separated paths to the list of inputs for the map-reduce job.

Parameters:
job - The job to modify
commaSeparatedPaths - Comma separated paths to be added to the list of inputs for the map-reduce job.
Throws:
IOException

setInputPaths

public static void setInputPaths(Job job,
                                 Path... inputPaths)
                          throws IOException
Set the array of Paths as the list of inputs for the map-reduce job.

Parameters:
job - The job to modify
inputPaths - the Paths of the input directories/files for the map-reduce job.
Throws:
IOException

addInputPath

public static void addInputPath(Job job,
                                Path path)
                         throws IOException
Add a Path to the list of inputs for the map-reduce job.

Parameters:
job - The Job to modify
path - Path to be added to the list of inputs for the map-reduce job.
Throws:
IOException

getInputPaths

public static Path[] getInputPaths(JobContext context)
Get the list of input Paths for the map-reduce job.

Parameters:
context - The job
Returns:
the list of input Paths for the map-reduce job.


Copyright © 2009 The Apache Software Foundation