Class NLineInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<LongWritable,Text>
org.apache.hadoop.mapreduce.lib.input.NLineInputFormat
NLineInputFormat which splits N lines of input as one split.
In many "pleasantly" parallel applications, each process/mapper
processes the same input file (s), but with computations are
controlled by different parameters.(Referred to as "parameter sweeps").
One way to achieve this, is to specify a set of parameters
(one set per line) as input in a control file
(which is the input path to the map-reduce application,
where as the input dataset is specified
via a config variable in JobConf.).
The NLineInputFormat can be used in such applications, that splits
the input file such that by default, one line is fed as
a value to one map task, and key is the offset.
i.e. (k,v) is (LongWritable, Text).
The location hints will span the whole mapred cluster.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
FileInputFormat.Counter -
Field Summary
FieldsFields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected static FileSplitcreateFileSplit(Path fileName, long begin, long length) NLineInputFormat uses LineRecordReader, which always reads (and consumes) at least one character out of its upper split boundary.createRecordReader(InputSplit genericSplit, TaskAttemptContext context) Create a record reader for a given split.static intGet the number of lines per splitgetSplits(JobContext job) Logically splits the set of input files for the job, splits N lines of the input as one split.getSplitsForFile(FileStatus status, Configuration conf, int numLinesPerSplit) static voidsetNumLinesPerSplit(Job job, int numLines) Set the number of lines per splitMethods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize, shrinkStatus
-
Field Details
-
LINES_PER_MAP
- See Also:
-
-
Constructor Details
-
NLineInputFormat
public NLineInputFormat()
-
-
Method Details
-
createRecordReader
public RecordReader<LongWritable,Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context) throws IOException Description copied from class:InputFormatCreate a record reader for a given split. The framework will callRecordReader.initialize(InputSplit, TaskAttemptContext)before the split is used.- Specified by:
createRecordReaderin classInputFormat<LongWritable,Text> - Parameters:
genericSplit- the split to be readcontext- the information about the task- Returns:
- a new record reader
- Throws:
IOException
-
getSplits
Logically splits the set of input files for the job, splits N lines of the input as one split.- Overrides:
getSplitsin classFileInputFormat<LongWritable,Text> - Parameters:
job- the job context- Returns:
- an array of
InputSplits for the job. - Throws:
IOException- See Also:
-
getSplitsForFile
public static List<FileSplit> getSplitsForFile(FileStatus status, Configuration conf, int numLinesPerSplit) throws IOException - Throws:
IOException
-
createFileSplit
NLineInputFormat uses LineRecordReader, which always reads (and consumes) at least one character out of its upper split boundary. So to make sure that each mapper gets N lines, we move back the upper split limits of each split by one character here.- Parameters:
fileName- Path of filebegin- the position of the first byte in the file to processlength- number of bytes in InputSplit- Returns:
- FileSplit
-
setNumLinesPerSplit
Set the number of lines per split- Parameters:
job- the job to modifynumLines- the number of lines per split
-
getNumLinesPerSplit
Get the number of lines per split- Parameters:
job- the job- Returns:
- the number of lines per split
-