org.apache.hadoop.mapred.FileInputFormat<K,V>

All Implemented Interfaces:: InputFormat<K,V>

Direct Known Subclasses:: FixedLengthInputFormat, KeyValueTextInputFormat, MultiFileInputFormat, NLineInputFormat, SequenceFileInputFormat, TextInputFormat

@Public @Stable public abstract class FileInputFormat<K,V> extends Object implements InputFormat<K,V>

A base class for file-based InputFormat.

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Implementations of FileInputFormat can also override the isSplitable(FileSystem, Path) method to prevent input files from being split-up in certain situations. Implementations that may deal with non-splittable files must override this method, since the default implementation assumes splitting is always possible.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

FileInputFormat.Counter

Deprecated.
Field Summary

Fields

Modifier and Type

Field

Description

static final String

INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS

static final String

INPUT_DIR_RECURSIVE

static final org.slf4j.Logger

LOG

static final String

NUM_INPUT_FILES
Constructor Summary

Constructors

Constructor

Description

FileInputFormat()
Method Summary

Modifier and Type

Method

Description

static void

addInputPath(JobConf conf, Path path)

Add a Path to the list of inputs for the map-reduce job.

protected void

addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter)

Add files in the input path recursively into the results.

static void

addInputPaths(JobConf conf, String commaSeparatedPaths)

Add the given comma separated paths to the list of inputs for the map-reduce job.

protected long

computeSplitSize(long goalSize, long minSize, long blockSize)

protected int

getBlockIndex(BlockLocation[] blkLocations, long offset)

static PathFilter

getInputPathFilter(JobConf conf)

Get a PathFilter instance of the filter set for the input paths.

static Path[]

getInputPaths(JobConf conf)

Get the list of input Paths for the map-reduce job.

abstract RecordReader<K,V>

getRecordReader(InputSplit split, JobConf job, Reporter reporter)

Get the RecordReader for the given InputSplit.

protected String[]

getSplitHosts(BlockLocation[] blkLocations, long offset, long splitSize, org.apache.hadoop.net.NetworkTopology clusterMap)

This function identifies and returns the hosts that contribute most for a given split.

InputSplit[]

getSplits(JobConf job, int numSplits)

Splits files returned by listStatus(JobConf) when they're too big.

protected boolean

isSplitable(FileSystem fs, Path filename)

Is the given filename splittable?

protected FileStatus[]

listStatus(JobConf job)

List input directories.

protected FileSplit

makeSplit(Path file, long start, long length, String[] hosts)

A factory that makes the split for this class.

protected FileSplit

makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts)

A factory that makes the split for this class.

static void

setInputPathFilter(JobConf conf, Class<? extends PathFilter> filter)

Set a PathFilter to be applied to the input paths for the map-reduce job.

static void

setInputPaths(JobConf conf, String commaSeparatedPaths)

Sets the given comma separated paths as the list of inputs for the map-reduce job.

static void

setInputPaths(JobConf conf, Path... inputPaths)

Set the array of Paths as the list of inputs for the map-reduce job.

protected void

setMinSplitSize(long minSplitSize)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- LOG
  
  public static final org.slf4j.Logger LOG
- NUM_INPUT_FILES
  
  public static final String NUM_INPUT_FILES
  See Also:
  
  Constant Field Values
- INPUT_DIR_RECURSIVE
  
  public static final String INPUT_DIR_RECURSIVE
  See Also:
  
  Constant Field Values
- INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
  
  public static final String INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
  See Also:
  
  Constant Field Values
Constructor Details
- FileInputFormat
  
  public FileInputFormat()
Method Details
- setMinSplitSize
  
  protected void setMinSplitSize(long minSplitSize)
- isSplitable
  
  protected boolean isSplitable(FileSystem fs, Path filename)
  
  Is the given filename splittable? Usually, true, but if the file is stream compressed, it will not be. The default implementation in FileInputFormat always returns true. Implementations that may deal with non-splittable files must override this method. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.
  
  Parameters:
  
  fs - the file system that the file is on
  
  filename - the file name to check
  
  Returns:
  
  is this file splitable?
- getRecordReader
  
  public abstract RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException
  
  Description copied from interface: InputFormat
  
  Get the RecordReader for the given InputSplit.
  It is the responsibility of the RecordReader to respect record boundaries while processing the logical split to present a record-oriented view to the individual task.
  
  Specified by:
  
  getRecordReader in interface InputFormat<K,V>
  
  Parameters:
  
  split - the InputSplit
  
  job - the job that this split belongs to
  
  Returns:
  
  a RecordReader
  
  Throws:
  
  IOException
- setInputPathFilter
  
  public static void setInputPathFilter(JobConf conf, Class<? extends PathFilter> filter)
  
  Set a PathFilter to be applied to the input paths for the map-reduce job.
  
  Parameters:
  
  filter - the PathFilter class use for filtering the input paths.
- getInputPathFilter
  
  public static PathFilter getInputPathFilter(JobConf conf)
  
  Get a PathFilter instance of the filter set for the input paths.
  
  Returns:
  
  the PathFilter instance set for the job, NULL if none has been set.
- addInputPathRecursively
  
  protected void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) throws IOException
  
  Add files in the input path recursively into the results.
  
  Parameters:
  
  result - The List to store all files.
  
  fs - The FileSystem.
  
  path - The input path.
  
  inputFilter - The input filter that can be used to filter files/dirs.
  
  Throws:
  
  IOException
- listStatus
  
  protected FileStatus[] listStatus(JobConf job) throws IOException
  
  List input directories. Subclasses may override to, e.g., select only files matching a regular expression. If security is enabled, this method collects delegation tokens from the input paths and adds them to the job's credentials.
  
  Parameters:
  
  job - the job to list input paths for and attach tokens to.
  
  Returns:
  
  array of FileStatus objects
  
  Throws:
  
  IOException - if zero items.
- makeSplit
  
  protected FileSplit makeSplit(Path file, long start, long length, String[] hosts)
  
  A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types
- makeSplit
  
  protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts)
  
  A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types
- getSplits
  
  public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException
  
  Splits files returned by listStatus(JobConf) when they're too big.
  
  Specified by:
  
  getSplits in interface InputFormat<K,V>
  
  Parameters:
  
  job - job configuration.
  
  numSplits - the desired number of splits, a hint.
  
  Returns:
  
  an array of InputSplits for the job.
  
  Throws:
  
  IOException
- computeSplitSize
  
  protected long computeSplitSize(long goalSize, long minSize, long blockSize)
- getBlockIndex
  
  protected int getBlockIndex(BlockLocation[] blkLocations, long offset)
- setInputPaths
  
  public static void setInputPaths(JobConf conf, String commaSeparatedPaths)
  
  Sets the given comma separated paths as the list of inputs for the map-reduce job.
  
  Parameters:
  
  conf - Configuration of the job
  
  commaSeparatedPaths - Comma separated paths to be set as the list of inputs for the map-reduce job.
- addInputPaths
  
  public static void addInputPaths(JobConf conf, String commaSeparatedPaths)
  
  Add the given comma separated paths to the list of inputs for the map-reduce job.
  
  Parameters:
  
  conf - The configuration of the job
  
  commaSeparatedPaths - Comma separated paths to be added to the list of inputs for the map-reduce job.
- setInputPaths
  
  public static void setInputPaths(JobConf conf, Path... inputPaths)
  
  Set the array of Paths as the list of inputs for the map-reduce job.
  
  Parameters:
  
  conf - Configuration of the job.
  
  inputPaths - the Paths of the input directories/files for the map-reduce job.
- addInputPath
  
  public static void addInputPath(JobConf conf, Path path)
  
  Add a Path to the list of inputs for the map-reduce job.
  
  Parameters:
  
  conf - The configuration of the job
  
  path - Path to be added to the list of inputs for the map-reduce job.
- getInputPaths
  
  public static Path[] getInputPaths(JobConf conf)
  
  Get the list of input Paths for the map-reduce job.
  
  Parameters:
  
  conf - The configuration of the job
  
  Returns:
  
  the list of input Paths for the map-reduce job.
- getSplitHosts
  
  protected String[] getSplitHosts(BlockLocation[] blkLocations, long offset, long splitSize, org.apache.hadoop.net.NetworkTopology clusterMap) throws IOException
  
  This function identifies and returns the hosts that contribute most for a given split. For calculating the contribution, rack locality is treated on par with host locality, so hosts from racks that contribute the most are preferred over hosts on racks that contribute less
  
  Parameters:
  
  blkLocations - The list of block locations
  
  offset -
  
  splitSize -
  
  Returns:
  
  an array of hosts that contribute most to this split
  
  Throws:
  
  IOException

Class FileInputFormat<K,V>

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

LOG

NUM_INPUT_FILES

INPUT_DIR_RECURSIVE

INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS

Constructor Details

FileInputFormat

Method Details

setMinSplitSize

isSplitable

getRecordReader

setInputPathFilter

getInputPathFilter

addInputPathRecursively

listStatus

makeSplit

makeSplit

getSplits

computeSplitSize

getBlockIndex

setInputPaths

addInputPaths

setInputPaths

addInputPath

getInputPaths

getSplitHosts