Class InputFormat<K,V>
- Direct Known Subclasses:
ComposableInputFormat,CompositeInputFormat,DBInputFormat,FileInputFormat
InputFormat describes the input-specification for a
Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the
job to:
- Validate the input-specification of the job.
-
Split-up the input file(s) into logical
InputSplits, each of which is then assigned to an individualMapper. -
Provide the
RecordReaderimplementation to be used to glean input records from the logicalInputSplitfor processing by theMapper.
The default behavior of file-based InputFormats, typically
sub-classes of FileInputFormat, is to split the
input into logical InputSplits based on the total size, in
bytes, of the input files. However, the FileSystem blocksize of
the input files is treated as an upper bound for input splits. A lower bound
on the split size can be set via
mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many
applications since record boundaries are to respected. In such cases, the
application has to also implement a RecordReader on whom lies the
responsibility to respect record-boundaries and present a record-oriented
view of the logical InputSplit to the individual task.
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionabstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) Create a record reader for a given split.abstract List<InputSplit>getSplits(JobContext context) Logically split the set of input files for the job.
-
Constructor Details
-
InputFormat
public InputFormat()
-
-
Method Details
-
getSplits
public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException Logically split the set of input files for the job.Each
InputSplitis then assigned to an individualMapperfor processing.Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also creates the
RecordReaderto read theInputSplit.- Parameters:
context- job configuration.- Returns:
- an array of
InputSplits for the job. - Throws:
IOExceptionInterruptedException
-
createRecordReader
public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException Create a record reader for a given split. The framework will callRecordReader.initialize(InputSplit, TaskAttemptContext)before the split is used.- Parameters:
split- the split to be readcontext- the information about the task- Returns:
- a new record reader
- Throws:
IOExceptionInterruptedException
-