org.apache.hadoop.filecache
Class DistributedCache

java.lang.Object
  extended by org.apache.hadoop.mapreduce.filecache.DistributedCache
      extended by org.apache.hadoop.filecache.DistributedCache

Deprecated.

@InterfaceAudience.Public
@InterfaceStability.Stable
@Deprecated
public class DistributedCache
extends org.apache.hadoop.mapreduce.filecache.DistributedCache

Distribute application-specific large, read-only files efficiently.

DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster.

The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. Jars may be optionally added to the classpath of the tasks, a rudimentary software distribution mechanism. Files have execution permissions. In older version of Hadoop Map/Reduce users could optionally ask for symlinks to be created in the working directory of the child task. In the current version symlinks are always created. If the URL does not have a fragment the name of the file or directory will be used. If multiple files or directories map to the same link name, the last one added, will be used. All others will not even be downloaded.

DistributedCache tracks modification timestamps of the cache files. Clearly the cache files should not be modified by the application or externally while the job is executing.

Here is an illustrative example on how to use the DistributedCache:

     // Setting up the cache for the application
     
     1. Copy the requisite files to the FileSystem:
     
     $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat  
     $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip  
     $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
     $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
     $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
     $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
     
     2. Setup the application's JobConf:
     
     JobConf job = new JobConf();
     DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), 
                                   job);
     DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
     DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
     
     3. Use the cached files in the Mapper
     or Reducer:
     
     public static class MapClass extends MapReduceBase  
     implements Mapper<K, V, K, V> {
     
       private Path[] localArchives;
       private Path[] localFiles;
       
       public void configure(JobConf job) {
         // Get the cached archives/files
         File f = new File("./map.zip/some/file/in/zip.txt");
       }
       
       public void map(K key, V value, 
                       OutputCollector<K, V> output, Reporter reporter) 
       throws IOException {
         // Use data from the cached archives/files here
         // ...
         // ...
         output.collect(k, v);
       }
     }
     
 

It is also very common to use the DistributedCache by using GenericOptionsParser. This class includes methods that should be used by users (specifically those mentioned in the example above, as well as DistributedCache.addArchiveToClassPath(Path, Configuration)), as well as methods intended for use by the MapReduce framework (e.g., JobClient).

See Also:
JobConf, JobClient, Job

Field Summary
static String CACHE_ARCHIVES
          Deprecated. 
static String CACHE_ARCHIVES_SIZES
          Deprecated. 
static String CACHE_ARCHIVES_TIMESTAMPS
          Deprecated. 
static String CACHE_FILES
          Deprecated. 
static String CACHE_FILES_SIZES
          Deprecated. 
static String CACHE_FILES_TIMESTAMPS
          Deprecated. 
static String CACHE_LOCALARCHIVES
          Deprecated. 
static String CACHE_LOCALFILES
          Deprecated. 
static String CACHE_SYMLINK
          Deprecated. 
 
Constructor Summary
DistributedCache()
          Deprecated.  
 
Method Summary
static void addLocalArchives(Configuration conf, String str)
          Deprecated. 
static void addLocalFiles(Configuration conf, String str)
          Deprecated. 
static void createAllSymlink(Configuration conf, File jobCacheDir, File workDir)
          Deprecated. Internal to MapReduce framework. Use DistributedCacheManager instead.
static FileStatus getFileStatus(Configuration conf, URI cache)
          Deprecated. 
static long getTimestamp(Configuration conf, URI cache)
          Deprecated. 
static void setArchiveTimestamps(Configuration conf, String timestamps)
          Deprecated. 
static void setFileTimestamps(Configuration conf, String timestamps)
          Deprecated. 
static void setLocalArchives(Configuration conf, String str)
          Deprecated. 
static void setLocalFiles(Configuration conf, String str)
          Deprecated. 
 
Methods inherited from class org.apache.hadoop.mapreduce.filecache.DistributedCache
addArchiveToClassPath, addArchiveToClassPath, addCacheArchive, addCacheFile, addFileToClassPath, addFileToClassPath, checkURIs, createSymlink, getArchiveClassPaths, getArchiveTimestamps, getArchiveVisibilities, getCacheArchives, getCacheFiles, getFileClassPaths, getFileTimestamps, getFileVisibilities, getLocalCacheArchives, getLocalCacheFiles, getSymlink, setCacheArchives, setCacheFiles
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CACHE_FILES_SIZES

@Deprecated
public static final String CACHE_FILES_SIZES
Deprecated. 
Warning: CACHE_FILES_SIZES is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_FILES_SIZES

See Also:
Constant Field Values

CACHE_ARCHIVES_SIZES

@Deprecated
public static final String CACHE_ARCHIVES_SIZES
Deprecated. 
Warning: CACHE_ARCHIVES_SIZES is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_ARCHIVES_SIZES

See Also:
Constant Field Values

CACHE_ARCHIVES_TIMESTAMPS

@Deprecated
public static final String CACHE_ARCHIVES_TIMESTAMPS
Deprecated. 
Warning: CACHE_ARCHIVES_TIMESTAMPS is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_ARCHIVES_TIMESTAMPS

See Also:
Constant Field Values

CACHE_FILES_TIMESTAMPS

@Deprecated
public static final String CACHE_FILES_TIMESTAMPS
Deprecated. 
Warning: CACHE_FILES_TIMESTAMPS is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_FILE_TIMESTAMPS

See Also:
Constant Field Values

CACHE_ARCHIVES

@Deprecated
public static final String CACHE_ARCHIVES
Deprecated. 
Warning: CACHE_ARCHIVES is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_ARCHIVES

See Also:
Constant Field Values

CACHE_FILES

@Deprecated
public static final String CACHE_FILES
Deprecated. 
Warning: CACHE_FILES is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_FILES

See Also:
Constant Field Values

CACHE_LOCALARCHIVES

@Deprecated
public static final String CACHE_LOCALARCHIVES
Deprecated. 
Warning: CACHE_LOCALARCHIVES is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_LOCALARCHIVES

See Also:
Constant Field Values

CACHE_LOCALFILES

@Deprecated
public static final String CACHE_LOCALFILES
Deprecated. 
Warning: CACHE_LOCALFILES is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_LOCALFILES

See Also:
Constant Field Values

CACHE_SYMLINK

@Deprecated
public static final String CACHE_SYMLINK
Deprecated. 
Warning: CACHE_SYMLINK is not a *public* constant. The variable is kept for M/R 1.x applications, M/R 2.x applications should use MRJobConfig.CACHE_SYMLINK

See Also:
Constant Field Values
Constructor Detail

DistributedCache

public DistributedCache()
Deprecated. 
Method Detail

addLocalArchives

@Deprecated
public static void addLocalArchives(Configuration conf,
                                               String str)
Deprecated. 

Add a archive that has been localized to the conf. Used by internal DistributedCache code.

Parameters:
conf - The conf to modify to contain the localized caches
str - a comma separated list of local archives

addLocalFiles

@Deprecated
public static void addLocalFiles(Configuration conf,
                                            String str)
Deprecated. 

Add a file that has been localized to the conf.. Used by internal DistributedCache code.

Parameters:
conf - The conf to modify to contain the localized caches
str - a comma separated list of local files

createAllSymlink

@Deprecated
public static void createAllSymlink(Configuration conf,
                                               File jobCacheDir,
                                               File workDir)
                             throws IOException
Deprecated. Internal to MapReduce framework. Use DistributedCacheManager instead.

This method create symlinks for all files in a given dir in another directory. Currently symlinks cannot be disabled. This is a NO-OP.

Parameters:
conf - the configuration
jobCacheDir - the target directory for creating symlinks
workDir - the directory in which the symlinks are created
Throws:
IOException

getFileStatus

@Deprecated
public static FileStatus getFileStatus(Configuration conf,
                                                  URI cache)
                                throws IOException
Deprecated. 

Returns FileStatus of a given cache file on hdfs. Internal to MapReduce.

Parameters:
conf - configuration
cache - cache file
Returns:
FileStatus of a given cache file on hdfs
Throws:
IOException

getTimestamp

@Deprecated
public static long getTimestamp(Configuration conf,
                                           URI cache)
                         throws IOException
Deprecated. 

Returns mtime of a given cache file on hdfs. Internal to MapReduce.

Parameters:
conf - configuration
cache - cache file
Returns:
mtime of a given cache file on hdfs
Throws:
IOException

setArchiveTimestamps

@Deprecated
public static void setArchiveTimestamps(Configuration conf,
                                                   String timestamps)
Deprecated. 

This is to check the timestamp of the archives to be localized. Used by internal MapReduce code.

Parameters:
conf - Configuration which stores the timestamp's
timestamps - comma separated list of timestamps of archives. The order should be the same as the order in which the archives are added.

setFileTimestamps

@Deprecated
public static void setFileTimestamps(Configuration conf,
                                                String timestamps)
Deprecated. 

This is to check the timestamp of the files to be localized. Used by internal MapReduce code.

Parameters:
conf - Configuration which stores the timestamp's
timestamps - comma separated list of timestamps of files. The order should be the same as the order in which the files are added.

setLocalArchives

@Deprecated
public static void setLocalArchives(Configuration conf,
                                               String str)
Deprecated. 

Set the conf to contain the location for localized archives. Used by internal DistributedCache code.

Parameters:
conf - The conf to modify to contain the localized caches
str - a comma separated list of local archives

setLocalFiles

@Deprecated
public static void setLocalFiles(Configuration conf,
                                            String str)
Deprecated. 

Set the conf to contain the location for localized files. Used by internal DistributedCache code.

Parameters:
conf - The conf to modify to contain the localized caches
str - a comma separated list of local files


Copyright © 2014 Apache Software Foundation. All Rights Reserved.