org.apache.hadoop.mapreduce.lib.input
Class FileInputFormat<K,V>

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>
Direct Known Subclasses:
CombineFileInputFormat, KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat, TextInputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class FileInputFormat<K,V>
extends InputFormat<K,V>

A base class for file-based InputFormats.

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext). Subclasses of FileInputFormat can also override the isSplitable(JobContext, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.


Field Summary
static String INPUT_DIR
           
static String INPUT_DIR_RECURSIVE
           
static String NUM_INPUT_FILES
           
static String PATHFILTER_CLASS
           
static String SPLIT_MAXSIZE
           
static String SPLIT_MINSIZE
           
 
Constructor Summary
FileInputFormat()
           
 
Method Summary
static void addInputPath(Job job, Path path)
          Add a Path to the list of inputs for the map-reduce job.
protected  void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter)
          Add files in the input path recursively into the results.
static void addInputPaths(Job job, String commaSeparatedPaths)
          Add the given comma separated paths to the list of inputs for the map-reduce job.
protected  long computeSplitSize(long blockSize, long minSize, long maxSize)
           
protected  int getBlockIndex(BlockLocation[] blkLocations, long offset)
           
protected  long getFormatMinSplitSize()
          Get the lower bound on split size imposed by the format.
static boolean getInputDirRecursive(JobContext job)
           
static PathFilter getInputPathFilter(JobContext context)
          Get a PathFilter instance of the filter set for the input paths.
static Path[] getInputPaths(JobContext context)
          Get the list of input Paths for the map-reduce job.
static long getMaxSplitSize(JobContext context)
          Get the maximum split size.
static long getMinSplitSize(JobContext job)
          Get the minimum split size
 List<InputSplit> getSplits(JobContext job)
          Generate the list of files and make them into FileSplits.
protected  boolean isSplitable(JobContext context, Path filename)
          Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be.
protected  List<FileStatus> listStatus(JobContext job)
          List input directories.
protected  FileSplit makeSplit(Path file, long start, long length, String[] hosts)
          A factory that makes the split for this class.
static void setInputDirRecursive(Job job, boolean inputDirRecursive)
           
static void setInputPathFilter(Job job, Class<? extends PathFilter> filter)
          Set a PathFilter to be applied to the input paths for the map-reduce job.
static void setInputPaths(Job job, Path... inputPaths)
          Set the array of Paths as the list of inputs for the map-reduce job.
static void setInputPaths(Job job, String commaSeparatedPaths)
          Sets the given comma separated paths as the list of inputs for the map-reduce job.
static void setMaxInputSplitSize(Job job, long size)
          Set the maximum split size
static void setMinInputSplitSize(Job job, long size)
          Set the minimum input split size
 
Methods inherited from class org.apache.hadoop.mapreduce.InputFormat
createRecordReader
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INPUT_DIR

public static final String INPUT_DIR
See Also:
Constant Field Values

SPLIT_MAXSIZE

public static final String SPLIT_MAXSIZE
See Also:
Constant Field Values

SPLIT_MINSIZE

public static final String SPLIT_MINSIZE
See Also:
Constant Field Values

PATHFILTER_CLASS

public static final String PATHFILTER_CLASS
See Also:
Constant Field Values

NUM_INPUT_FILES

public static final String NUM_INPUT_FILES
See Also:
Constant Field Values

INPUT_DIR_RECURSIVE

public static final String INPUT_DIR_RECURSIVE
See Also:
Constant Field Values
Constructor Detail

FileInputFormat

public FileInputFormat()
Method Detail

setInputDirRecursive

public static void setInputDirRecursive(Job job,
                                        boolean inputDirRecursive)
Parameters:
job - the job to modify
inputDirRecursive -

getInputDirRecursive

public static boolean getInputDirRecursive(JobContext job)
Parameters:
job - the job to look at.
Returns:
should the files to be read recursively?

getFormatMinSplitSize

protected long getFormatMinSplitSize()
Get the lower bound on split size imposed by the format.

Returns:
the number of bytes of the minimal split for this format

isSplitable

protected boolean isSplitable(JobContext context,
                              Path filename)
Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.

Parameters:
context - the job context
filename - the file name to check
Returns:
is this file splitable?

setInputPathFilter

public static void setInputPathFilter(Job job,
                                      Class<? extends PathFilter> filter)
Set a PathFilter to be applied to the input paths for the map-reduce job.

Parameters:
job - the job to modify
filter - the PathFilter class use for filtering the input paths.

setMinInputSplitSize

public static void setMinInputSplitSize(Job job,
                                        long size)
Set the minimum input split size

Parameters:
job - the job to modify
size - the minimum size

getMinSplitSize

public static long getMinSplitSize(JobContext job)
Get the minimum split size

Parameters:
job - the job
Returns:
the minimum number of bytes that can be in a split

setMaxInputSplitSize

public static void setMaxInputSplitSize(Job job,
                                        long size)
Set the maximum split size

Parameters:
job - the job to modify
size - the maximum split size

getMaxSplitSize

public static long getMaxSplitSize(JobContext context)
Get the maximum split size.

Parameters:
context - the job to look at.
Returns:
the maximum number of bytes a split can include

getInputPathFilter

public static PathFilter getInputPathFilter(JobContext context)
Get a PathFilter instance of the filter set for the input paths.

Returns:
the PathFilter instance set for the job, NULL if none has been set.

listStatus

protected List<FileStatus> listStatus(JobContext job)
                               throws IOException
List input directories. Subclasses may override to, e.g., select only files matching a regular expression.

Parameters:
job - the job to list input paths for
Returns:
array of FileStatus objects
Throws:
IOException - if zero items.

addInputPathRecursively

protected void addInputPathRecursively(List<FileStatus> result,
                                       FileSystem fs,
                                       Path path,
                                       PathFilter inputFilter)
                                throws IOException
Add files in the input path recursively into the results.

Parameters:
result - The List to store all files.
fs - The FileSystem.
path - The input path.
inputFilter - The input filter that can be used to filter files/dirs.
Throws:
IOException

makeSplit

protected FileSplit makeSplit(Path file,
                              long start,
                              long length,
                              String[] hosts)
A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types


getSplits

public List<InputSplit> getSplits(JobContext job)
                           throws IOException
Generate the list of files and make them into FileSplits.

Specified by:
getSplits in class InputFormat<K,V>
Parameters:
job - the job context
Returns:
an array of InputSplits for the job.
Throws:
IOException

computeSplitSize

protected long computeSplitSize(long blockSize,
                                long minSize,
                                long maxSize)

getBlockIndex

protected int getBlockIndex(BlockLocation[] blkLocations,
                            long offset)

setInputPaths

public static void setInputPaths(Job job,
                                 String commaSeparatedPaths)
                          throws IOException
Sets the given comma separated paths as the list of inputs for the map-reduce job.

Parameters:
job - the job
commaSeparatedPaths - Comma separated paths to be set as the list of inputs for the map-reduce job.
Throws:
IOException

addInputPaths

public static void addInputPaths(Job job,
                                 String commaSeparatedPaths)
                          throws IOException
Add the given comma separated paths to the list of inputs for the map-reduce job.

Parameters:
job - The job to modify
commaSeparatedPaths - Comma separated paths to be added to the list of inputs for the map-reduce job.
Throws:
IOException

setInputPaths

public static void setInputPaths(Job job,
                                 Path... inputPaths)
                          throws IOException
Set the array of Paths as the list of inputs for the map-reduce job.

Parameters:
job - The job to modify
inputPaths - the Paths of the input directories/files for the map-reduce job.
Throws:
IOException

addInputPath

public static void addInputPath(Job job,
                                Path path)
                         throws IOException
Add a Path to the list of inputs for the map-reduce job.

Parameters:
job - The Job to modify
path - Path to be added to the list of inputs for the map-reduce job.
Throws:
IOException

getInputPaths

public static Path[] getInputPaths(JobContext context)
Get the list of input Paths for the map-reduce job.

Parameters:
context - The job
Returns:
the list of input Paths for the map-reduce job.


Copyright © 2013 Apache Software Foundation. All Rights Reserved.