Interface InputFormat<K,V>
- All Known Subinterfaces:
ComposableInputFormat<K,V>
- All Known Implementing Classes:
CombineFileInputFormat,CombineSequenceFileInputFormat,CombineTextInputFormat,CompositeInputFormat,DBInputFormat,FileInputFormat,FixedLengthInputFormat,KeyValueTextInputFormat,MultiFileInputFormat,NLineInputFormat,Parser.Node,SequenceFileAsBinaryInputFormat,SequenceFileAsTextInputFormat,SequenceFileInputFilter,SequenceFileInputFormat,TextInputFormat
InputFormat describes the input-specification for a
Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the
job to:
- Validate the input-specification of the job.
-
Split-up the input file(s) into logical
InputSplits, each of which is then assigned to an individualMapper. -
Provide the
RecordReaderimplementation to be used to glean input records from the logicalInputSplitfor processing by theMapper.
The default behavior of file-based InputFormats, typically
sub-classes of FileInputFormat, is to split the
input into logical InputSplits based on the total size, in
bytes, of the input files. However, the FileSystem blocksize of
the input files is treated as an upper bound for input splits. A lower bound
on the split size can be set via
mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many
applications since record boundaries are to be respected. In such cases, the
application has to also implement a RecordReader on whom lies the
responsibilty to respect record-boundaries and present a record-oriented
view of the logical InputSplit to the individual task.
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptiongetRecordReader(InputSplit split, JobConf job, Reporter reporter) Get theRecordReaderfor the givenInputSplit.Logically split the set of input files for the job.
-
Method Details
-
getSplits
Logically split the set of input files for the job.Each
InputSplitis then assigned to an individualMapperfor processing.Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple.
- Parameters:
job- job configuration.numSplits- the desired number of splits, a hint.- Returns:
- an array of
InputSplits for the job. - Throws:
IOException
-
getRecordReader
RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException Get theRecordReaderfor the givenInputSplit.It is the responsibility of the
RecordReaderto respect record boundaries while processing the logical split to present a record-oriented view to the individual task.- Parameters:
split- theInputSplitjob- the job that this split belongs to- Returns:
- a
RecordReader - Throws:
IOException
-