org.apache.hadoop.mapred.join (Hadoop 1.0.4 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package org.apache.hadoop.mapred.join

Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map.

See:
Description

Interface Summary
ComposableInputFormat<K extends WritableComparable,V extends Writable>	Refinement of InputFormat requiring implementors to provide ComposableRecordReader instead of RecordReader.
ComposableRecordReader<K extends WritableComparable,V extends Writable>	Additional operations required of a RecordReader to participate in a join.
ResetableIterator<T extends Writable>	This defines an interface to a stateful Iterator that can replay elements added to it directly.

Class Summary
ArrayListBackedIterator<X extends Writable>	This class provides an implementation of ResetableIterator.
CompositeInputFormat<K extends WritableComparable>	An InputFormat capable of performing joins over a set of data sources sorted and partitioned the same way.
CompositeInputSplit	This InputSplit contains a set of child InputSplits.
CompositeRecordReader<K extends WritableComparable,V extends Writable,X extends Writable>	A RecordReader that can effect joins of RecordReaders sharing a common key type and partitioning.
InnerJoinRecordReader<K extends WritableComparable>	Full inner join.
JoinRecordReader<K extends WritableComparable>	Base class for Composite joins returning Tuples of arbitrary Writables.
MultiFilterRecordReader<K extends WritableComparable,V extends Writable>	Base class for Composite join returning values derived from multiple sources, but generally not tuples.
OuterJoinRecordReader<K extends WritableComparable>	Full outer join.
OverrideRecordReader<K extends WritableComparable,V extends Writable>	Prefer the "rightmost" data source for this key.
Parser	Very simple shift-reduce parser for join expressions.
Parser.Node
Parser.NodeToken
Parser.NumToken
Parser.StrToken
Parser.Token	Tagged-union type for tokens from the join expression.
ResetableIterator.EMPTY<U extends Writable>
StreamBackedIterator<X extends Writable>	This class provides an implementation of ResetableIterator.
TupleWritable	Writable type storing multiple `Writable`s.
WrappedRecordReader<K extends WritableComparable,U extends Writable>	Proxy class for a RecordReader participating in the join framework.

Enum Summary
Parser.TType

Package org.apache.hadoop.mapred.join Description

Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map. This could save costs in re-partitioning, sorting, shuffling, and writing out data required in the general case.

Interface

The attached code offers the following interface to users of these classes.

property	required	value
mapred.join.expr	yes	Join expression to effect over input data
mapred.join.keycomparator	no	`WritableComparator` class to use for comparing keys
mapred.join.define.<ident>	no	Class mapped to identifier in join expression

The join expression understands the following grammar:

func ::= <ident>([<func>,]*<func>)
func ::= tbl(<class>,"<path>");

Operations included in this patch are partitioned into one of two types: join operations emitting tuples and "multi-filter" operations emitting a single value from (but not necessarily included in) a set of input values. For a given key, each operation will consider the cross product of all values for all sources at that node.

Identifiers supported by default:

identifier	type	description
inner	Join	Full inner join
outer	Join	Full outer join
override	MultiFilter	For a given key, prefer values from the rightmost source

A user of this class must set the InputFormat for the job to CompositeInputFormat and define a join expression accepted by the preceding grammar. For example, both of the following are acceptable:

inner(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/bar"),
      tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/baz"))

outer(override(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
                   "hdfs://host:8020/foo/bar"),
               tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
                   "hdfs://host:8020/foo/baz")),
      tbl(org.apache.hadoop.mapred/SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/rab"))

CompositeInputFormat includes a handful of convenience methods to aid construction of these verbose statements.

As in the second example, joins may be nested. Users may provide a comparator class in the mapred.join.keycomparator property to specify the ordering of their keys, or accept the default comparator as returned by WritableComparator.get(keyclass).

Users can specify their own join operations, typically by overriding JoinRecordReader or MultiFilterRecordReader and mapping that class to an identifier in the join expression using the mapred.join.define.ident property, where ident is the identifier appearing in the join expression. Users may elect to emit- or modify- values passing through their join operation. Consulting the existing operations for guidance is recommended. Adding arguments is considerably more complex (and only partially supported), as one must also add a Node type to the parse tree. One is probably better off extending RecordReader in most cases.

JIRA