Package org.apache.hadoop.mapred.lib
Class CombineFileInputFormat<K,V>
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat<K,V>
org.apache.hadoop.mapred.lib.CombineFileInputFormat<K,V>
- All Implemented Interfaces:
InputFormat<K,V>
- Direct Known Subclasses:
CombineSequenceFileInputFormat,CombineTextInputFormat
@Public
@Stable
public abstract class CombineFileInputFormat<K,V>
extends CombineFileInputFormat<K,V>
implements InputFormat<K,V>
An abstract
InputFormat that returns CombineFileSplit's
in InputFormat.getSplits(JobConf, int) method.
Splits are constructed from the files under the input paths.
A split cannot have files from different pools.
Each split returned may contain blocks from different files.
If a maxSplitSize is specified, then blocks on the same node are
combined to form a single split. Blocks that are left over are
then combined with other blocks in the same rack.
If maxSplitSize is not specified, then blocks from the same rack
are combined in a single split; no attempt is made to create
node-local splits.
If the maxSplitSize is equal to the block size, then this class
is similar to the default spliting behaviour in Hadoop: each
block is a locally processed split.
Subclasses implement InputFormat.getRecordReader(InputSplit, JobConf, Reporter)
to construct RecordReader's for CombineFileSplit's.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
FileInputFormat.Counter -
Field Summary
Fields inherited from class org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat
SPLIT_MINSIZE_PERNODE, SPLIT_MINSIZE_PERRACKFields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected voidcreatePool(JobConf conf, List<PathFilter> filters) Deprecated.protected voidcreatePool(JobConf conf, PathFilter... filters) Deprecated.createRecordReader(InputSplit split, TaskAttemptContext context) This is not implemented yet.abstract RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) This is not implemented yet.Logically split the set of input files for the job.protected booleanisSplitable(FileSystem fs, Path file) protected booleanisSplitable(JobContext context, Path file) Subclasses should avoid overriding this method and should instead only overrideisSplitable(FileSystem, Path).protected FileStatus[]listStatus(JobConf job) List input directories.Methods inherited from class org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat
createPool, createPool, getFileBlockLocations, getSplits, setMaxSplitSize, setMinSplitSizeNode, setMinSplitSizeRackMethods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize, shrinkStatus
-
Constructor Details
-
CombineFileInputFormat
public CombineFileInputFormat()default constructor
-
-
Method Details
-
getSplits
Description copied from interface:InputFormatLogically split the set of input files for the job.Each
InputSplitis then assigned to an individualMapperfor processing.Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple.
- Specified by:
getSplitsin interfaceInputFormat<K,V> - Parameters:
job- job configuration.numSplits- the desired number of splits, a hint.- Returns:
- an array of
InputSplits for the job. - Throws:
IOException
-
createPool
Deprecated.Create a new pool and add the filters to it. A split cannot have files from different pools. -
createPool
Deprecated.Create a new pool and add the filters to it. A pathname can satisfy any one of the specified filters. A split cannot have files from different pools. -
getRecordReader
public abstract RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException This is not implemented yet.- Specified by:
getRecordReaderin interfaceInputFormat<K,V> - Parameters:
split- theInputSplitjob- the job that this split belongs to- Returns:
- a
RecordReader - Throws:
IOException
-
createRecordReader
public RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException Description copied from class:CombineFileInputFormatThis is not implemented yet.- Specified by:
createRecordReaderin classCombineFileInputFormat<K,V> - Parameters:
split- the split to be readcontext- the information about the task- Returns:
- a new record reader
- Throws:
IOException
-
listStatus
List input directories. Subclasses may override to, e.g., select only files matching a regular expression.- Parameters:
job- the job to list input paths for- Returns:
- array of FileStatus objects
- Throws:
IOException- if zero items.
-
isSplitable
Subclasses should avoid overriding this method and should instead only overrideisSplitable(FileSystem, Path). The implementation of this method simply calls the other method to preserve compatibility.- Overrides:
isSplitablein classCombineFileInputFormat<K,V> - Parameters:
context- the job contextfile- the file name to check- Returns:
- is this file splitable?
- See Also:
-
isSplitable
-
CombineFileInputFormat.createPool(List).