Class DistributedCache
DistributedCache is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached
via the JobConf. The
DistributedCache assumes that the files specified via urls are
already present on the FileSystem at the path specified by the url
and are accessible by every machine in the cluster.
The framework will copy the necessary files on to the worker node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the workers.
DistributedCache can be used to distribute simple, read-only
data/text files and/or more complex types such as archives, jars etc.
Archives (zip, tar and tgz/tar.gz files) are un-archived at the worker nodes.
Jars may be optionally added to the classpath of the tasks, a rudimentary
software distribution mechanism. Files have execution permissions.
In older version of Hadoop Map/Reduce users could optionally ask for symlinks
to be created in the working directory of the child task. In the current
version symlinks are always created. If the URL does not have a fragment
the name of the file or directory will be used. If multiple files or
directories map to the same link name, the last one added, will be used. All
others will not even be downloaded.
DistributedCache tracks modification timestamps of the cache
files. Clearly the cache files should not be modified by the application
or externally while the job is executing.
Here is an illustrative example on how to use the
DistributedCache:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip"), job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz"), job);
3. Use the cached files in the Mapper
or Reducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
File f = new File("./map.zip/some/file/in/zip.txt");
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
It is also very common to use the DistributedCache by using
GenericOptionsParser.
This class includes methods that should be used by users
(specifically those mentioned in the example above, as well
as DistributedCache.addArchiveToClassPath(Path, Configuration)),
as well as methods intended for use by the MapReduce framework
(e.g., JobClient).-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringDeprecated.static final StringDeprecated.static final StringDeprecated.static final StringDeprecated.static final StringDeprecated.static final StringDeprecated.static final StringDeprecated.static final StringDeprecated.static final StringDeprecated.Fields inherited from class org.apache.hadoop.mapreduce.filecache.DistributedCache
WILDCARD -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidaddLocalArchives(Configuration conf, String str) Deprecated.static voidaddLocalFiles(Configuration conf, String str) Deprecated.static voidcreateAllSymlink(Configuration conf, File jobCacheDir, File workDir) Deprecated.Internal to MapReduce framework.static FileStatusgetFileStatus(Configuration conf, URI cache) Deprecated.static longgetTimestamp(Configuration conf, URI cache) Deprecated.static voidsetArchiveTimestamps(Configuration conf, String timestamps) Deprecated.static voidsetFileTimestamps(Configuration conf, String timestamps) Deprecated.static voidsetLocalArchives(Configuration conf, String str) Deprecated.static voidsetLocalFiles(Configuration conf, String str) Deprecated.Methods inherited from class org.apache.hadoop.mapreduce.filecache.DistributedCache
addArchiveToClassPath, addArchiveToClassPath, addCacheArchive, addCacheFile, addFileToClassPath, addFileToClassPath, addFileToClassPath, checkURIs, createSymlink, getArchiveClassPaths, getArchiveTimestamps, getArchiveVisibilities, getCacheArchives, getCacheFiles, getFileClassPaths, getFileTimestamps, getFileVisibilities, getLocalCacheArchives, getLocalCacheFiles, getSymlink, setCacheArchives, setCacheFiles
-
Field Details
-
CACHE_FILES_SIZES
Deprecated.Warning:CACHE_FILES_SIZESis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_FILES_SIZES- See Also:
-
CACHE_ARCHIVES_SIZES
Deprecated.Warning:CACHE_ARCHIVES_SIZESis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_ARCHIVES_SIZES- See Also:
-
CACHE_ARCHIVES_TIMESTAMPS
Deprecated.Warning:CACHE_ARCHIVES_TIMESTAMPSis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_ARCHIVES_TIMESTAMPS- See Also:
-
CACHE_FILES_TIMESTAMPS
Deprecated.Warning:CACHE_FILES_TIMESTAMPSis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_FILE_TIMESTAMPS- See Also:
-
CACHE_ARCHIVES
Deprecated.Warning:CACHE_ARCHIVESis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_ARCHIVES- See Also:
-
CACHE_FILES
Deprecated.Warning:CACHE_FILESis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_FILES- See Also:
-
CACHE_LOCALARCHIVES
Deprecated.Warning:CACHE_LOCALARCHIVESis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_LOCALARCHIVES- See Also:
-
CACHE_LOCALFILES
Deprecated.Warning:CACHE_LOCALFILESis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_LOCALFILES- See Also:
-
CACHE_SYMLINK
Deprecated.Warning:CACHE_SYMLINKis not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should useMRJobConfig.CACHE_SYMLINK- See Also:
-
-
Constructor Details
-
DistributedCache
public DistributedCache()Deprecated.
-
-
Method Details
-
addLocalArchives
Deprecated.Add a archive that has been localized to the conf. Used by internal DistributedCache code.- Parameters:
conf- The conf to modify to contain the localized cachesstr- a comma separated list of local archives
-
addLocalFiles
Deprecated.Add a file that has been localized to the conf.. Used by internal DistributedCache code.- Parameters:
conf- The conf to modify to contain the localized cachesstr- a comma separated list of local files
-
createAllSymlink
@Deprecated public static void createAllSymlink(Configuration conf, File jobCacheDir, File workDir) throws IOException Deprecated.Internal to MapReduce framework. Use DistributedCacheManager instead.This method create symlinks for all files in a given dir in another directory. Currently symlinks cannot be disabled. This is a NO-OP.- Parameters:
conf- the configurationjobCacheDir- the target directory for creating symlinksworkDir- the directory in which the symlinks are created- Throws:
IOException
-
getFileStatus
@Deprecated public static FileStatus getFileStatus(Configuration conf, URI cache) throws IOException Deprecated.ReturnsFileStatusof a given cache file on hdfs. Internal to MapReduce.- Parameters:
conf- configurationcache- cache file- Returns:
FileStatusof a given cache file on hdfs- Throws:
IOException
-
getTimestamp
Deprecated.Returns mtime of a given cache file on hdfs. Internal to MapReduce.- Parameters:
conf- configurationcache- cache file- Returns:
- mtime of a given cache file on hdfs
- Throws:
IOException
-
setArchiveTimestamps
Deprecated.This is to check the timestamp of the archives to be localized. Used by internal MapReduce code.- Parameters:
conf- Configuration which stores the timestamp'stimestamps- comma separated list of timestamps of archives. The order should be the same as the order in which the archives are added.
-
setFileTimestamps
Deprecated.This is to check the timestamp of the files to be localized. Used by internal MapReduce code.- Parameters:
conf- Configuration which stores the timestamp'stimestamps- comma separated list of timestamps of files. The order should be the same as the order in which the files are added.
-
setLocalArchives
Deprecated.Set the conf to contain the location for localized archives. Used by internal DistributedCache code.- Parameters:
conf- The conf to modify to contain the localized cachesstr- a comma separated list of local archives
-
setLocalFiles
Deprecated.Set the conf to contain the location for localized files. Used by internal DistributedCache code.- Parameters:
conf- The conf to modify to contain the localized cachesstr- a comma separated list of local files
-