|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Examples | |
---|---|
org.apache.hadoop.examples | Hadoop example code. |
org.apache.hadoop.examples.dancing | This package is a distributed implementation of Knuth's dancing links algorithm that can run under Hadoop. |
org.apache.hadoop.examples.terasort | This package consists of 3 map/reduce applications for Hadoop to compete in the annual terabyte sort competition. |
contrib: Streaming | |
---|---|
org.apache.hadoop.streaming | Hadoop Streaming is a utility which allows users to create and run Map-Reduce jobs with any executables (e.g. |
org.apache.hadoop.streaming.io |
contrib: DataJoin | |
---|---|
org.apache.hadoop.contrib.utils.join |
contrib: Index | |
---|---|
org.apache.hadoop.contrib.index.example | |
org.apache.hadoop.contrib.index.lucene | |
org.apache.hadoop.contrib.index.main | |
org.apache.hadoop.contrib.index.mapred |
contrib: FailMon | |
---|---|
org.apache.hadoop.contrib.failmon |
Hadoop is a distributed computing platform.
Hadoop primarily consists of the Hadoop Distributed FileSystem (HDFS) and an implementation of the Map-Reduce programming paradigm.
Hadoop is a software framework that lets one easily write and run applications that process vast amounts of data. Here's what makes Hadoop especially useful:
If your platform does not have the required software listed above, you will have to install it.
For example on Ubuntu Linux:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the packages:
First, you need to get a copy of the Hadoop code.
Edit the file conf/hadoop-env.sh to define at least JAVA_HOME.
Try the following command:
bin/hadoopThis will display the documentation for the Hadoop command script.
By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows:
mkdir inputThis will display counts for each match of the regular expression.
Note that input is specified as a directory containing input files and that output is also specified as a directory where parts are written.
JobTracker
(MapReduce master)
host and port. This is specified with the configuration property
mapred.job.tracker.
(We also set the HDFS replication level to 1 in order to reduce warnings when running on a single node.)
Now check that the command
ssh localhost
does not
require a password. If it does, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
A new distributed filesystem must be formatted with the following command, run on the master node:
bin/hadoop namenode -format
The Hadoop daemons are started with the following command:
bin/start-all.sh
Daemon log output is written to the logs/ directory.
Input files are copied into the distributed filesystem as follows:
bin/hadoop fs -put input input
Things are run as before, but output must be copied locally to examine it:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'When you're done, stop the daemons with:
bin/stop-all.sh
Fully distributed operation is just like the pseudo-distributed operation described above, except, specify:
Finally, list all slave hostnames or IP addresses in your conf/slaves file, one per line. Then format your filesystem and start your cluster on your master node, as above.
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |