FileInputFormat (Apache Hadoop Main 2.4.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.mapred
Class FileInputFormat<K,V>

java.lang.Object
  org.apache.hadoop.mapred.FileInputFormat<K,V>

All Implemented Interfaces:: InputFormat<K,V>

Direct Known Subclasses:: FixedLengthInputFormat, KeyValueTextInputFormat, MultiFileInputFormat, NLineInputFormat, SequenceFileInputFormat, TextInputFormat

@InterfaceAudience.Public @InterfaceStability.Stable public abstract class FileInputFormat<K,V>
extends Object
implements InputFormat<K,V>
extends Object
implements InputFormat<K,V>

A base class for file-based InputFormat.

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.

Field Summary
`static String`	`INPUT_DIR_RECURSIVE`
`static org.apache.commons.logging.Log`	`LOG`
`static String`	`NUM_INPUT_FILES`

Constructor Summary
`FileInputFormat()`

Method Summary
`static void`	`addInputPath(JobConf conf, Path path)` Add a `Path` to the list of inputs for the map-reduce job.
`protected void`	`addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter)` Add files in the input path recursively into the results.
`static void`	`addInputPaths(JobConf conf, String commaSeparatedPaths)` Add the given comma separated paths to the list of inputs for the map-reduce job.
`protected long`	`computeSplitSize(long goalSize, long minSize, long blockSize)`
`protected int`	`getBlockIndex(BlockLocation[] blkLocations, long offset)`
`static PathFilter`	`getInputPathFilter(JobConf conf)` Get a PathFilter instance of the filter set for the input paths.
`static Path[]`	`getInputPaths(JobConf conf)` Get the list of input `Path`s for the map-reduce job.
`abstract RecordReader<K,V>`	`getRecordReader(InputSplit split, JobConf job, Reporter reporter)` Get the `RecordReader` for the given `InputSplit`.
`protected String[]`	`getSplitHosts(BlockLocation[] blkLocations, long offset, long splitSize, org.apache.hadoop.net.NetworkTopology clusterMap)` This function identifies and returns the hosts that contribute most for a given split.
`InputSplit[]`	`getSplits(JobConf job, int numSplits)` Splits files returned by `listStatus(JobConf)` when they're too big.
`protected boolean`	`isSplitable(FileSystem fs, Path filename)` Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be.
`protected FileStatus[]`	`listStatus(JobConf job)` List input directories.
`protected FileSplit`	`makeSplit(Path file, long start, long length, String[] hosts)` A factory that makes the split for this class.
`static void`	`setInputPathFilter(JobConf conf, Class<? extends PathFilter> filter)` Set a PathFilter to be applied to the input paths for the map-reduce job.
`static void`	`setInputPaths(JobConf conf, Path... inputPaths)` Set the array of `Path`s as the list of inputs for the map-reduce job.
`static void`	`setInputPaths(JobConf conf, String commaSeparatedPaths)` Sets the given comma separated paths as the list of inputs for the map-reduce job.
`protected void`	`setMinSplitSize(long minSplitSize)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG

NUM_INPUT_FILES

public static final String NUM_INPUT_FILES

See Also:: Constant Field Values

INPUT_DIR_RECURSIVE

public static final String INPUT_DIR_RECURSIVE

See Also:: Constant Field Values

Constructor Detail

FileInputFormat

public FileInputFormat()

Method Detail

setMinSplitSize

protected void setMinSplitSize(long minSplitSize)

isSplitable

protected boolean isSplitable(FileSystem fs,
                              Path filename)

Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.

Parameters:: fs - the file system that the file is on; filename - the file name to check
Returns:: is this file splitable?

getRecordReader

public abstract RecordReader<K,V> getRecordReader(InputSplit split,
                                                  JobConf job,
                                                  Reporter reporter)
                                           throws IOException

Description copied from interface: InputFormat

Get the RecordReader for the given InputSplit.

It is the responsibility of the RecordReader to respect record boundaries while processing the logical split to present a record-oriented view to the individual task.

Specified by:: getRecordReader in interface InputFormat<K,V>

Parameters:: split - the InputSplit; job - the job that this split belongs to
Returns:: a RecordReader
Throws:: IOException

setInputPathFilter

public static void setInputPathFilter(JobConf conf,
                                      Class<? extends PathFilter> filter)

Set a PathFilter to be applied to the input paths for the map-reduce job.

Parameters:: filter - the PathFilter class use for filtering the input paths.

getInputPathFilter

public static PathFilter getInputPathFilter(JobConf conf)

Get a PathFilter instance of the filter set for the input paths.

Returns:: the PathFilter instance set for the job, NULL if none has been set.

addInputPathRecursively

protected void addInputPathRecursively(List<FileStatus> result,
                                       FileSystem fs,
                                       Path path,
                                       PathFilter inputFilter)
                                throws IOException

Add files in the input path recursively into the results.

Parameters:: result - The List to store all files.; fs - The FileSystem.; path - The input path.; inputFilter - The input filter that can be used to filter files/dirs.
Throws:: IOException

listStatus

protected FileStatus[] listStatus(JobConf job)
                           throws IOException

List input directories. Subclasses may override to, e.g., select only files matching a regular expression.

Parameters:: job - the job to list input paths for
Returns:: array of FileStatus objects
Throws:: IOException - if zero items.

makeSplit

protected FileSplit makeSplit(Path file,
                              long start,
                              long length,
                              String[] hosts)

A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types

getSplits

public InputSplit[] getSplits(JobConf job,
                              int numSplits)
                       throws IOException

Splits files returned by listStatus(JobConf) when they're too big.

Specified by:: getSplits in interface InputFormat<K,V>

Parameters:: job - job configuration.; numSplits - the desired number of splits, a hint.
Returns:: an array of InputSplits for the job.
Throws:: IOException

computeSplitSize

protected long computeSplitSize(long goalSize,
                                long minSize,
                                long blockSize)

getBlockIndex

protected int getBlockIndex(BlockLocation[] blkLocations,
                            long offset)

setInputPaths

public static void setInputPaths(JobConf conf,
                                 String commaSeparatedPaths)

Sets the given comma separated paths as the list of inputs for the map-reduce job.

Parameters:: conf - Configuration of the job; commaSeparatedPaths - Comma separated paths to be set as the list of inputs for the map-reduce job.

addInputPaths

public static void addInputPaths(JobConf conf,
                                 String commaSeparatedPaths)

Add the given comma separated paths to the list of inputs for the map-reduce job.

Parameters:: conf - The configuration of the job; commaSeparatedPaths - Comma separated paths to be added to the list of inputs for the map-reduce job.

setInputPaths

public static void setInputPaths(JobConf conf,
                                 Path... inputPaths)

Set the array of Paths as the list of inputs for the map-reduce job.

Parameters:: conf - Configuration of the job.; inputPaths - the Paths of the input directories/files for the map-reduce job.

addInputPath

public static void addInputPath(JobConf conf,
                                Path path)

Add a Path to the list of inputs for the map-reduce job.

Parameters:: conf - The configuration of the job; path - Path to be added to the list of inputs for the map-reduce job.

getInputPaths

public static Path[] getInputPaths(JobConf conf)

Get the list of input Paths for the map-reduce job.

Parameters:: conf - The configuration of the job
Returns:: the list of input Paths for the map-reduce job.

getSplitHosts

protected String[] getSplitHosts(BlockLocation[] blkLocations,
                                 long offset,
                                 long splitSize,
                                 org.apache.hadoop.net.NetworkTopology clusterMap)
                          throws IOException

This function identifies and returns the hosts that contribute most for a given split. For calculating the contribution, rack locality is treated on par with host locality, so hosts from racks that contribute the most are preferred over hosts on racks that contribute less

Parameters:: blkLocations - The list of block locations; offset -; splitSize -
Returns:: array of hosts that contribute most to this split
Throws:: IOException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.mapred Class FileInputFormat<K,V>

LOG

NUM_INPUT_FILES

INPUT_DIR_RECURSIVE

FileInputFormat

setMinSplitSize

isSplitable

getRecordReader

setInputPathFilter

getInputPathFilter

addInputPathRecursively

listStatus

makeSplit

getSplits

computeSplitSize

getBlockIndex

setInputPaths

addInputPaths

setInputPaths

addInputPath

getInputPaths

getSplitHosts

org.apache.hadoop.mapred
Class FileInputFormat<K,V>