@InterfaceAudience.Public @InterfaceStability.Stable public abstract class FileInputFormat<K,V> extends Object implements InputFormat<K,V>
InputFormat
.
FileInputFormat
is the base class for all file-based
InputFormat
s. This provides a generic implementation of
getSplits(JobConf, int)
.
Implementations of FileInputFormat
can also override the
isSplitable(FileSystem, Path)
method to prevent input files
from being split-up in certain situations. Implementations that may
deal with non-splittable files must override this method, since
the default implementation assumes splitting is always possible.
Modifier and Type | Field and Description |
---|---|
static String |
INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS |
static String |
INPUT_DIR_RECURSIVE |
static org.slf4j.Logger |
LOG |
static String |
NUM_INPUT_FILES |
Constructor and Description |
---|
FileInputFormat() |
Modifier and Type | Method and Description |
---|---|
static void |
addInputPath(JobConf conf,
Path path)
Add a
Path to the list of inputs for the map-reduce job. |
protected void |
addInputPathRecursively(List<FileStatus> result,
FileSystem fs,
Path path,
PathFilter inputFilter)
Add files in the input path recursively into the results.
|
static void |
addInputPaths(JobConf conf,
String commaSeparatedPaths)
Add the given comma separated paths to the list of inputs for
the map-reduce job.
|
protected long |
computeSplitSize(long goalSize,
long minSize,
long blockSize) |
protected int |
getBlockIndex(BlockLocation[] blkLocations,
long offset) |
static PathFilter |
getInputPathFilter(JobConf conf)
Get a PathFilter instance of the filter set for the input paths.
|
static Path[] |
getInputPaths(JobConf conf)
Get the list of input
Path s for the map-reduce job. |
abstract RecordReader<K,V> |
getRecordReader(InputSplit split,
JobConf job,
Reporter reporter)
Get the
RecordReader for the given InputSplit . |
protected String[] |
getSplitHosts(BlockLocation[] blkLocations,
long offset,
long splitSize,
org.apache.hadoop.net.NetworkTopology clusterMap)
This function identifies and returns the hosts that contribute
most for a given split.
|
InputSplit[] |
getSplits(JobConf job,
int numSplits)
Splits files returned by
listStatus(JobConf) when
they're too big. |
protected boolean |
isSplitable(FileSystem fs,
Path filename)
Is the given filename splittable? Usually, true, but if the file is
stream compressed, it will not be.
|
protected FileStatus[] |
listStatus(JobConf job)
List input directories.
|
protected FileSplit |
makeSplit(Path file,
long start,
long length,
String[] hosts)
A factory that makes the split for this class.
|
protected FileSplit |
makeSplit(Path file,
long start,
long length,
String[] hosts,
String[] inMemoryHosts)
A factory that makes the split for this class.
|
static void |
setInputPathFilter(JobConf conf,
Class<? extends PathFilter> filter)
Set a PathFilter to be applied to the input paths for the map-reduce job.
|
static void |
setInputPaths(JobConf conf,
Path... inputPaths)
Set the array of
Path s as the list of inputs
for the map-reduce job. |
static void |
setInputPaths(JobConf conf,
String commaSeparatedPaths)
Sets the given comma separated paths as the list of inputs
for the map-reduce job.
|
protected void |
setMinSplitSize(long minSplitSize) |
public static final org.slf4j.Logger LOG
public static final String NUM_INPUT_FILES
public static final String INPUT_DIR_RECURSIVE
public static final String INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
protected void setMinSplitSize(long minSplitSize)
protected boolean isSplitable(FileSystem fs, Path filename)
FileInputFormat
always returns
true. Implementations that may deal with non-splittable files must
override this method.
FileInputFormat
implementations can override this and return
false
to ensure that individual input files are never split-up
so that Mapper
s process entire files.fs
- the file system that the file is onfilename
- the file name to checkpublic abstract RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException
InputFormat
RecordReader
for the given InputSplit
.
It is the responsibility of the RecordReader
to respect
record boundaries while processing the logical split to present a
record-oriented view to the individual task.
getRecordReader
in interface InputFormat<K,V>
split
- the InputSplit
job
- the job that this split belongs toRecordReader
IOException
public static void setInputPathFilter(JobConf conf, Class<? extends PathFilter> filter)
filter
- the PathFilter class use for filtering the input paths.public static PathFilter getInputPathFilter(JobConf conf)
protected void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) throws IOException
result
- The List to store all files.fs
- The FileSystem.path
- The input path.inputFilter
- The input filter that can be used to filter files/dirs.IOException
protected FileStatus[] listStatus(JobConf job) throws IOException
job
- the job to list input paths forIOException
- if zero items.protected FileSplit makeSplit(Path file, long start, long length, String[] hosts)
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts)
public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException
listStatus(JobConf)
when
they're too big.getSplits
in interface InputFormat<K,V>
job
- job configuration.numSplits
- the desired number of splits, a hint.InputSplit
s for the job.IOException
protected long computeSplitSize(long goalSize, long minSize, long blockSize)
protected int getBlockIndex(BlockLocation[] blkLocations, long offset)
public static void setInputPaths(JobConf conf, String commaSeparatedPaths)
conf
- Configuration of the jobcommaSeparatedPaths
- Comma separated paths to be set as
the list of inputs for the map-reduce job.public static void addInputPaths(JobConf conf, String commaSeparatedPaths)
conf
- The configuration of the jobcommaSeparatedPaths
- Comma separated paths to be added to
the list of inputs for the map-reduce job.public static void setInputPaths(JobConf conf, Path... inputPaths)
Path
s as the list of inputs
for the map-reduce job.conf
- Configuration of the job.inputPaths
- the Path
s of the input directories/files
for the map-reduce job.public static void addInputPath(JobConf conf, Path path)
Path
to the list of inputs for the map-reduce job.conf
- The configuration of the jobpath
- Path
to be added to the list of inputs for
the map-reduce job.public static Path[] getInputPaths(JobConf conf)
Path
s for the map-reduce job.conf
- The configuration of the jobPath
s for the map-reduce job.protected String[] getSplitHosts(BlockLocation[] blkLocations, long offset, long splitSize, org.apache.hadoop.net.NetworkTopology clusterMap) throws IOException
blkLocations
- The list of block locationsoffset
- splitSize
- IOException
Copyright © 2022 Apache Software Foundation. All rights reserved.