Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
- Direct Known Subclasses:
ChainMapper,FieldSelectionMapper,InverseMapper,MultithreadedMapper,RegexMapper,TokenCounterMapper,ValueAggregatorMapper,WrappedMapper
Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
The Hadoop Map-Reduce framework spawns one map task for each
InputSplit generated by the InputFormat for the job.
Mapper implementations can access the Configuration for
the job via the JobContext.getConfiguration().
The framework first calls
setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by
map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context)
for each key/value pair in the InputSplit. Finally
cleanup(org.apache.hadoop.mapreduce.Mapper.Context) is called.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to a Reducer to
determine the final output. Users can control the sorting and grouping by
specifying two key RawComparator classes.
The Mapper outputs are partitioned per
Reducer. Users can control which keys (and hence records) go to
which Reducer by implementing a custom Partitioner.
Users can optionally specify a combiner, via
Job.setCombinerClass(Class), to perform local aggregation of the
intermediate outputs, which helps to cut down the amount of data transferred
from the Mapper to the Reducer.
Applications can specify if and how the intermediate
outputs are to be compressed and which CompressionCodecs are to be
used via the Configuration.
If the job has zero
reduces then the output of the Mapper is directly written
to the OutputFormat without sorting by keys.
Example:
public class TokenCounterMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Applications may override the
run(org.apache.hadoop.mapreduce.Mapper.Context) method to exert
greater control on map processing e.g. multi-threaded Mappers
etc.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionclassorg.apache.hadoop.mapreduce.Mapper.ContextTheContextpassed on to theMapperimplementations. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected voidCalled once at the end of the task.protected voidmap(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.org.apache.hadoop.mapreduce.Mapper.Context context) Called once for each key/value pair in the input split.voidExpert users can override this method for more complete control over the execution of the Mapper.protected voidCalled once at the beginning of the task.
-
Constructor Details
-
Mapper
public Mapper()
-
-
Method Details
-
setup
protected void setup(Mapper<KEYIN, VALUEIN, throws IOException, InterruptedExceptionKEYOUT, VALUEOUT>.org.apache.hadoop.mapreduce.Mapper.Context context) Called once at the beginning of the task.- Throws:
IOExceptionInterruptedException
-
map
protected void map(KEYIN key, VALUEIN value, Mapper<KEYIN, VALUEIN, throws IOException, InterruptedExceptionKEYOUT, VALUEOUT>.org.apache.hadoop.mapreduce.Mapper.Context context) Called once for each key/value pair in the input split. Most applications should override this, but the default is the identity function.- Throws:
IOExceptionInterruptedException
-
cleanup
protected void cleanup(Mapper<KEYIN, VALUEIN, throws IOException, InterruptedExceptionKEYOUT, VALUEOUT>.org.apache.hadoop.mapreduce.Mapper.Context context) Called once at the end of the task.- Throws:
IOExceptionInterruptedException
-
run
public void run(Mapper<KEYIN, VALUEIN, throws IOException, InterruptedExceptionKEYOUT, VALUEOUT>.org.apache.hadoop.mapreduce.Mapper.Context context) Expert users can override this method for more complete control over the execution of the Mapper.- Parameters:
context-- Throws:
IOExceptionInterruptedException
-