Package org.apache.hadoop.mapred
Class FileInputFormat<K,V>
java.lang.Object
org.apache.hadoop.mapred.FileInputFormat<K,V>
- All Implemented Interfaces:
InputFormat<K,V>
- Direct Known Subclasses:
FixedLengthInputFormat,KeyValueTextInputFormat,MultiFileInputFormat,NLineInputFormat,SequenceFileInputFormat,TextInputFormat
@Public
@Stable
public abstract class FileInputFormat<K,V>
extends Object
implements InputFormat<K,V>
A base class for file-based
InputFormat.
FileInputFormat is the base class for all file-based
InputFormats. This provides a generic implementation of
getSplits(JobConf, int).
Implementations of FileInputFormat can also override the
isSplitable(FileSystem, Path) method to prevent input files
from being split-up in certain situations. Implementations that may
deal with non-splittable files must override this method, since
the default implementation assumes splitting is always possible.
-
Nested Class Summary
Nested Classes -
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidaddInputPath(JobConf conf, Path path) Add aPathto the list of inputs for the map-reduce job.protected voidaddInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) Add files in the input path recursively into the results.static voidaddInputPaths(JobConf conf, String commaSeparatedPaths) Add the given comma separated paths to the list of inputs for the map-reduce job.protected longcomputeSplitSize(long goalSize, long minSize, long blockSize) protected intgetBlockIndex(BlockLocation[] blkLocations, long offset) static PathFiltergetInputPathFilter(JobConf conf) Get a PathFilter instance of the filter set for the input paths.static Path[]getInputPaths(JobConf conf) Get the list of inputPaths for the map-reduce job.abstract RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) Get theRecordReaderfor the givenInputSplit.protected String[]getSplitHosts(BlockLocation[] blkLocations, long offset, long splitSize, org.apache.hadoop.net.NetworkTopology clusterMap) This function identifies and returns the hosts that contribute most for a given split.Splits files returned bylistStatus(JobConf)when they're too big.protected booleanisSplitable(FileSystem fs, Path filename) Is the given filename splittable?protected FileStatus[]listStatus(JobConf job) List input directories.protected FileSplitA factory that makes the split for this class.protected FileSplitA factory that makes the split for this class.static voidsetInputPathFilter(JobConf conf, Class<? extends PathFilter> filter) Set a PathFilter to be applied to the input paths for the map-reduce job.static voidsetInputPaths(JobConf conf, String commaSeparatedPaths) Sets the given comma separated paths as the list of inputs for the map-reduce job.static voidsetInputPaths(JobConf conf, Path... inputPaths) Set the array ofPaths as the list of inputs for the map-reduce job.protected voidsetMinSplitSize(long minSplitSize)
-
Field Details
-
LOG
public static final org.slf4j.Logger LOG -
NUM_INPUT_FILES
- See Also:
-
INPUT_DIR_RECURSIVE
- See Also:
-
INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
- See Also:
-
-
Constructor Details
-
FileInputFormat
public FileInputFormat()
-
-
Method Details
-
setMinSplitSize
protected void setMinSplitSize(long minSplitSize) -
isSplitable
Is the given filename splittable? Usually, true, but if the file is stream compressed, it will not be. The default implementation inFileInputFormatalways returns true. Implementations that may deal with non-splittable files must override this method.FileInputFormatimplementations can override this and returnfalseto ensure that individual input files are never split-up so thatMappers process entire files.- Parameters:
fs- the file system that the file is onfilename- the file name to check- Returns:
- is this file splitable?
-
getRecordReader
public abstract RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException Description copied from interface:InputFormatGet theRecordReaderfor the givenInputSplit.It is the responsibility of the
RecordReaderto respect record boundaries while processing the logical split to present a record-oriented view to the individual task.- Specified by:
getRecordReaderin interfaceInputFormat<K,V> - Parameters:
split- theInputSplitjob- the job that this split belongs to- Returns:
- a
RecordReader - Throws:
IOException
-
setInputPathFilter
Set a PathFilter to be applied to the input paths for the map-reduce job.- Parameters:
filter- the PathFilter class use for filtering the input paths.
-
getInputPathFilter
Get a PathFilter instance of the filter set for the input paths.- Returns:
- the PathFilter instance set for the job, NULL if none has been set.
-
addInputPathRecursively
protected void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) throws IOException Add files in the input path recursively into the results.- Parameters:
result- The List to store all files.fs- The FileSystem.path- The input path.inputFilter- The input filter that can be used to filter files/dirs.- Throws:
IOException
-
listStatus
List input directories. Subclasses may override to, e.g., select only files matching a regular expression. If security is enabled, this method collects delegation tokens from the input paths and adds them to the job's credentials.- Parameters:
job- the job to list input paths for and attach tokens to.- Returns:
- array of FileStatus objects
- Throws:
IOException- if zero items.
-
makeSplit
A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types -
makeSplit
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts) A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types -
getSplits
Splits files returned bylistStatus(JobConf)when they're too big.- Specified by:
getSplitsin interfaceInputFormat<K,V> - Parameters:
job- job configuration.numSplits- the desired number of splits, a hint.- Returns:
- an array of
InputSplits for the job. - Throws:
IOException
-
computeSplitSize
protected long computeSplitSize(long goalSize, long minSize, long blockSize) -
getBlockIndex
-
setInputPaths
Sets the given comma separated paths as the list of inputs for the map-reduce job.- Parameters:
conf- Configuration of the jobcommaSeparatedPaths- Comma separated paths to be set as the list of inputs for the map-reduce job.
-
addInputPaths
Add the given comma separated paths to the list of inputs for the map-reduce job.- Parameters:
conf- The configuration of the jobcommaSeparatedPaths- Comma separated paths to be added to the list of inputs for the map-reduce job.
-
setInputPaths
Set the array ofPaths as the list of inputs for the map-reduce job.- Parameters:
conf- Configuration of the job.inputPaths- thePaths of the input directories/files for the map-reduce job.
-
addInputPath
Add aPathto the list of inputs for the map-reduce job.- Parameters:
conf- The configuration of the jobpath-Pathto be added to the list of inputs for the map-reduce job.
-
getInputPaths
Get the list of inputPaths for the map-reduce job.- Parameters:
conf- The configuration of the job- Returns:
- the list of input
Paths for the map-reduce job.
-
getSplitHosts
protected String[] getSplitHosts(BlockLocation[] blkLocations, long offset, long splitSize, org.apache.hadoop.net.NetworkTopology clusterMap) throws IOException This function identifies and returns the hosts that contribute most for a given split. For calculating the contribution, rack locality is treated on par with host locality, so hosts from racks that contribute the most are preferred over hosts on racks that contribute less- Parameters:
blkLocations- The list of block locationsoffset-splitSize-- Returns:
- an array of hosts that contribute most to this split
- Throws:
IOException
-