org.apache.hadoop.mapreduce.lib.input
Class NLineInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<LongWritable,Text>
org.apache.hadoop.mapreduce.lib.input.NLineInputFormat
@InterfaceAudience.Public
@InterfaceStability.Stable
public class NLineInputFormat
- extends FileInputFormat<LongWritable,Text>
NLineInputFormat which splits N lines of input as one split.
In many "pleasantly" parallel applications, each process/mapper
processes the same input file (s), but with computations are
controlled by different parameters.(Referred to as "parameter sweeps").
One way to achieve this, is to specify a set of parameters
(one set per line) as input in a control file
(which is the input path to the map-reduce application,
where as the input dataset is specified
via a config variable in JobConf.).
The NLineInputFormat can be used in such applications, that splits
the input file such that by default, one line is fed as
a value to one map task, and key is the offset.
i.e. (k,v) is (LongWritable, Text).
The location hints will span the whole mapred cluster.
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat |
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LINES_PER_MAP
public static final String LINES_PER_MAP
- See Also:
- Constant Field Values
NLineInputFormat
public NLineInputFormat()
createRecordReader
public RecordReader<LongWritable,Text> createRecordReader(InputSplit genericSplit,
TaskAttemptContext context)
throws IOException
- Description copied from class:
InputFormat
- Create a record reader for a given split. The framework will call
RecordReader.initialize(InputSplit, TaskAttemptContext)
before
the split is used.
- Specified by:
createRecordReader
in class InputFormat<LongWritable,Text>
- Parameters:
genericSplit
- the split to be readcontext
- the information about the task
- Returns:
- a new record reader
- Throws:
IOException
getSplits
public List<InputSplit> getSplits(JobContext job)
throws IOException
- Logically splits the set of input files for the job, splits N lines
of the input as one split.
- Overrides:
getSplits
in class FileInputFormat<LongWritable,Text>
- Parameters:
job
- job configuration.
- Returns:
- an array of
InputSplit
s for the job.
- Throws:
IOException
- See Also:
FileInputFormat.getSplits(JobContext)
getSplitsForFile
public static List<FileSplit> getSplitsForFile(FileStatus status,
Configuration conf,
int numLinesPerSplit)
throws IOException
- Throws:
IOException
createFileSplit
protected static FileSplit createFileSplit(Path fileName,
long begin,
long length)
- NLineInputFormat uses LineRecordReader, which always reads
(and consumes) at least one character out of its upper split
boundary. So to make sure that each mapper gets N lines, we
move back the upper split limits of each split
by one character here.
- Parameters:
fileName
- Path of filebegin
- the position of the first byte in the file to processlength
- number of bytes in InputSplit
- Returns:
- FileSplit
setNumLinesPerSplit
public static void setNumLinesPerSplit(Job job,
int numLines)
- Set the number of lines per split
- Parameters:
job
- the job to modifynumLines
- the number of lines per split
getNumLinesPerSplit
public static int getNumLinesPerSplit(JobContext job)
- Get the number of lines per split
- Parameters:
job
- the job
- Returns:
- the number of lines per split
Copyright © 2009 The Apache Software Foundation