@InterfaceAudience.Public @InterfaceStability.Stable public interface InputFormat<K,V>
InputFormat
describes the input-specification for a
Map-Reduce job.
The Map-Reduce framework relies on the InputFormat
of the
job to:
InputSplit
s, each of
which is then assigned to an individual Mapper
.
RecordReader
implementation to be used to glean
input records from the logical InputSplit
for processing by
the Mapper
.
The default behavior of file-based InputFormat
s, typically
sub-classes of FileInputFormat
, is to split the
input into logical InputSplit
s based on the total size, in
bytes, of the input files. However, the FileSystem
blocksize of
the input files is treated as an upper bound for input splits. A lower bound
on the split size can be set via
mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many
applications since record boundaries are to respected. In such cases, the
application has to also implement a RecordReader
on whom lies the
responsibilty to respect record-boundaries and present a record-oriented
view of the logical InputSplit
to the individual task.
InputSplit
,
RecordReader
,
JobClient
,
FileInputFormat
Modifier and Type | Method and Description |
---|---|
RecordReader<K,V> |
getRecordReader(InputSplit split,
JobConf job,
Reporter reporter)
Get the
RecordReader for the given InputSplit . |
InputSplit[] |
getSplits(JobConf job,
int numSplits)
Logically split the set of input files for the job.
|
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException
Each InputSplit
is then assigned to an individual Mapper
for processing.
Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple.
job
- job configuration.numSplits
- the desired number of splits, a hint.InputSplit
s for the job.IOException
RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException
RecordReader
for the given InputSplit
.
It is the responsibility of the RecordReader
to respect
record boundaries while processing the logical split to present a
record-oriented view to the individual task.
split
- the InputSplit
job
- the job that this split belongs toRecordReader
IOException
Copyright © 2014 Apache Software Foundation. All Rights Reserved.