org.apache.hadoop.mapreduce.lib.input
Class CombineFileInputFormat<K,V>

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>
          extended by org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat<K,V>
Direct Known Subclasses:
CombineFileInputFormat, CombineSequenceFileInputFormat, CombineTextInputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class CombineFileInputFormat<K,V>
extends FileInputFormat<K,V>

An abstract InputFormat that returns CombineFileSplit's in InputFormat.getSplits(JobContext) method. Splits are constructed from the files under the input paths. A split cannot have files from different pools. Each split returned may contain blocks from different files. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop: each block is a locally processed split. Subclasses implement InputFormat.createRecordReader(InputSplit, TaskAttemptContext) to construct RecordReader's for CombineFileSplit's.

See Also:
CombineFileSplit

Field Summary
static String SPLIT_MINSIZE_PERNODE
           
static String SPLIT_MINSIZE_PERRACK
           
 
Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
 
Constructor Summary
CombineFileInputFormat()
          default constructor
 
Method Summary
protected  void createPool(List<PathFilter> filters)
          Create a new pool and add the filters to it.
protected  void createPool(PathFilter... filters)
          Create a new pool and add the filters to it.
abstract  RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context)
          This is not implemented yet.
protected  BlockLocation[] getFileBlockLocations(FileSystem fs, FileStatus stat)
           
 List<InputSplit> getSplits(JobContext job)
          Generate the list of files and make them into FileSplits.
protected  boolean isSplitable(JobContext context, Path file)
          Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be.
protected  void setMaxSplitSize(long maxSplitSize)
          Specify the maximum size (in bytes) of each split.
protected  void setMinSplitSizeNode(long minSplitSizeNode)
          Specify the minimum size (in bytes) of each split per node.
protected  void setMinSplitSizeRack(long minSplitSizeRack)
          Specify the minimum size (in bytes) of each split per rack.
 
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, listStatus, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SPLIT_MINSIZE_PERNODE

public static final String SPLIT_MINSIZE_PERNODE
See Also:
Constant Field Values

SPLIT_MINSIZE_PERRACK

public static final String SPLIT_MINSIZE_PERRACK
See Also:
Constant Field Values
Constructor Detail

CombineFileInputFormat

public CombineFileInputFormat()
default constructor

Method Detail

setMaxSplitSize

protected void setMaxSplitSize(long maxSplitSize)
Specify the maximum size (in bytes) of each split. Each split is approximately equal to the specified size.


setMinSplitSizeNode

protected void setMinSplitSizeNode(long minSplitSizeNode)
Specify the minimum size (in bytes) of each split per node. This applies to data that is left over after combining data on a single node into splits that are of maximum size specified by maxSplitSize. This leftover data will be combined into its own split if its size exceeds minSplitSizeNode.


setMinSplitSizeRack

protected void setMinSplitSizeRack(long minSplitSizeRack)
Specify the minimum size (in bytes) of each split per rack. This applies to data that is left over after combining data on a single rack into splits that are of maximum size specified by maxSplitSize. This leftover data will be combined into its own split if its size exceeds minSplitSizeRack.


createPool

protected void createPool(List<PathFilter> filters)
Create a new pool and add the filters to it. A split cannot have files from different pools.


createPool

protected void createPool(PathFilter... filters)
Create a new pool and add the filters to it. A pathname can satisfy any one of the specified filters. A split cannot have files from different pools.


isSplitable

protected boolean isSplitable(JobContext context,
                              Path file)
Description copied from class: FileInputFormat
Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.

Overrides:
isSplitable in class FileInputFormat<K,V>
Parameters:
context - the job context
file - the file name to check
Returns:
is this file splitable?

getSplits

public List<InputSplit> getSplits(JobContext job)
                           throws IOException
Description copied from class: FileInputFormat
Generate the list of files and make them into FileSplits.

Overrides:
getSplits in class FileInputFormat<K,V>
Parameters:
job - the job context
Returns:
an array of InputSplits for the job.
Throws:
IOException

createRecordReader

public abstract RecordReader<K,V> createRecordReader(InputSplit split,
                                                     TaskAttemptContext context)
                                              throws IOException
This is not implemented yet.

Specified by:
createRecordReader in class InputFormat<K,V>
Parameters:
split - the split to be read
context - the information about the task
Returns:
a new record reader
Throws:
IOException

getFileBlockLocations

protected BlockLocation[] getFileBlockLocations(FileSystem fs,
                                                FileStatus stat)
                                         throws IOException
Throws:
IOException


Copyright © 2014 Apache Software Foundation. All Rights Reserved.