org.apache.hadoop.mapreduce.lib.input.NLineInputFormat

@Public @Stable public class NLineInputFormat extends FileInputFormat<LongWritable,Text>

NLineInputFormat which splits N lines of input as one split. In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.(Referred to as "parameter sweeps"). One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.). The NLineInputFormat can be used in such applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text). The location hints will span the whole mapred cluster.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
FileInputFormat.Counter
Field Summary

Fields

Modifier and Type

Field

Description

static final String

LINES_PER_MAP

Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
Constructor Summary

Constructors

Constructor

Description

NLineInputFormat()
Method Summary

Modifier and Type

Method

Description

protected static FileSplit

createFileSplit(Path fileName, long begin, long length)

NLineInputFormat uses LineRecordReader, which always reads (and consumes) at least one character out of its upper split boundary.

RecordReader<LongWritable,Text>

createRecordReader(InputSplit genericSplit, TaskAttemptContext context)

Create a record reader for a given split.

static int

getNumLinesPerSplit(JobContext job)

Get the number of lines per split

List<InputSplit>

getSplits(JobContext job)

Logically splits the set of input files for the job, splits N lines of the input as one split.

static List<FileSplit>

getSplitsForFile(FileStatus status, Configuration conf, int numLinesPerSplit)

static void

setNumLinesPerSplit(Job job, int numLines)

Set the number of lines per split

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize, shrinkStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- LINES_PER_MAP
  
  public static final String LINES_PER_MAP
  See Also:
  
  Constant Field Values
Constructor Details
- NLineInputFormat
  
  public NLineInputFormat()
Method Details
- createRecordReader
  
  public RecordReader<LongWritable,Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context) throws IOException
  
  Description copied from class: InputFormat
  
  Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.
  
  Specified by:
  
  createRecordReader in class InputFormat<LongWritable,Text>
  
  Parameters:
  
  genericSplit - the split to be read
  
  context - the information about the task
  
  Returns:
  
  a new record reader
  
  Throws:
  
  IOException
- getSplits
  
  public List<InputSplit> getSplits(JobContext job) throws IOException
  
  Logically splits the set of input files for the job, splits N lines of the input as one split.
  Overrides:
  
  getSplits in class FileInputFormat<LongWritable,Text>
  
  Parameters:
  
  job - the job context
  
  Returns:
  
  an array of InputSplits for the job.
  
  Throws:
  
  IOException
  
  See Also:
  
  FileInputFormat.getSplits(JobContext)
- getSplitsForFile
  
  public static List<FileSplit> getSplitsForFile(FileStatus status, Configuration conf, int numLinesPerSplit) throws IOException
  
  Throws:
  
  IOException
- createFileSplit
  
  protected static FileSplit createFileSplit(Path fileName, long begin, long length)
  
  NLineInputFormat uses LineRecordReader, which always reads (and consumes) at least one character out of its upper split boundary. So to make sure that each mapper gets N lines, we move back the upper split limits of each split by one character here.
  
  Parameters:
  
  fileName - Path of file
  
  begin - the position of the first byte in the file to process
  
  length - number of bytes in InputSplit
  
  Returns:
  
  FileSplit
- setNumLinesPerSplit
  
  public static void setNumLinesPerSplit(Job job, int numLines)
  
  Set the number of lines per split
  
  Parameters:
  
  job - the job to modify
  
  numLines - the number of lines per split
- getNumLinesPerSplit
  
  public static int getNumLinesPerSplit(JobContext job)
  
  Get the number of lines per split
  
  Parameters:
  
  job - the job
  
  Returns:
  
  the number of lines per split

Class NLineInputFormat

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Field Summary

Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Methods inherited from class java.lang.Object

Field Details

LINES_PER_MAP

Constructor Details

NLineInputFormat

Method Details

createRecordReader

getSplits

getSplitsForFile

createFileSplit

setNumLinesPerSplit

getNumLinesPerSplit