Package org.apache.hadoop.mapred.join

Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map.

See:
          Description

Interface Summary
ComposableInputFormat<K extends WritableComparable,V extends Writable> Refinement of InputFormat requiring implementors to provide ComposableRecordReader instead of RecordReader.
ComposableRecordReader<K extends WritableComparable,V extends Writable> Additional operations required of a RecordReader to participate in a join.
ResetableIterator<T extends Writable> This defines an interface to a stateful Iterator that can replay elements added to it directly.
 

Class Summary
ArrayListBackedIterator<X extends Writable> This class provides an implementation of ResetableIterator.
CompositeInputFormat<K extends WritableComparable> An InputFormat capable of performing joins over a set of data sources sorted and partitioned the same way.
CompositeInputSplit This InputSplit contains a set of child InputSplits.
CompositeRecordReader<K extends WritableComparable,V extends Writable,X extends Writable> A RecordReader that can effect joins of RecordReaders sharing a common key type and partitioning.
InnerJoinRecordReader<K extends WritableComparable> Full inner join.
JoinRecordReader<K extends WritableComparable> Base class for Composite joins returning Tuples of arbitrary Writables.
MultiFilterRecordReader<K extends WritableComparable,V extends Writable> Base class for Composite join returning values derived from multiple sources, but generally not tuples.
OuterJoinRecordReader<K extends WritableComparable> Full outer join.
OverrideRecordReader<K extends WritableComparable,V extends Writable> Prefer the "rightmost" data source for this key.
Parser Very simple shift-reduce parser for join expressions.
Parser.Node  
Parser.NodeToken  
Parser.NumToken  
Parser.StrToken  
Parser.Token Tagged-union type for tokens from the join expression.
ResetableIterator.EMPTY<U extends Writable>  
StreamBackedIterator<X extends Writable> This class provides an implementation of ResetableIterator.
TupleWritable Writable type storing multiple Writables.
WrappedRecordReader<K extends WritableComparable,U extends Writable> Proxy class for a RecordReader participating in the join framework.
 

Enum Summary
Parser.TType  
 

Package org.apache.hadoop.mapred.join Description

Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map. This could save costs in re-partitioning, sorting, shuffling, and writing out data required in the general case.

Interface

The attached code offers the following interface to users of these classes.

propertyrequiredvalue
mapred.join.expryes Join expression to effect over input data
mapred.join.keycomparatorno WritableComparator class to use for comparing keys
mapred.join.define.<ident>no Class mapped to identifier in join expression

The join expression understands the following grammar:

func ::= <ident>([<func>,]*<func>)
func ::= tbl(<class>,"<path>");

Operations included in this patch are partitioned into one of two types: join operations emitting tuples and "multi-filter" operations emitting a single value from (but not necessarily included in) a set of input values. For a given key, each operation will consider the cross product of all values for all sources at that node.

Identifiers supported by default:

identifiertypedescription
innerJoinFull inner join
outerJoinFull outer join
overrideMultiFilter For a given key, prefer values from the rightmost source

A user of this class must set the InputFormat for the job to CompositeInputFormat and define a join expression accepted by the preceding grammar. For example, both of the following are acceptable:

inner(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/bar"),
      tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/baz"))

outer(override(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
                   "hdfs://host:8020/foo/bar"),
               tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
                   "hdfs://host:8020/foo/baz")),
      tbl(org.apache.hadoop.mapred/SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/rab"))

CompositeInputFormat includes a handful of convenience methods to aid construction of these verbose statements.

As in the second example, joins may be nested. Users may provide a comparator class in the mapred.join.keycomparator property to specify the ordering of their keys, or accept the default comparator as returned by WritableComparator.get(keyclass).

Users can specify their own join operations, typically by overriding JoinRecordReader or MultiFilterRecordReader and mapping that class to an identifier in the join expression using the mapred.join.define.ident property, where ident is the identifier appearing in the join expression. Users may elect to emit- or modify- values passing through their join operation. Consulting the existing operations for guidance is recommended. Adding arguments is considerably more complex (and only partially supported), as one must also add a Node type to the parse tree. One is probably better off extending RecordReader in most cases.

JIRA



Copyright © 2009 The Apache Software Foundation