MapReduce support for the YARN shared cache allows MapReduce jobs to take advantage of additional resource caching. This saves network bandwidth between the job submission client as well as within the YARN cluster itself. This will reduce job submission time and overall job runtime.
First, your YARN cluster must have the shared cache service running. Please see YARN documentation for information on how to setup the shared cache service.
A MapReduce user can specify what resources are eligible to be uploaded to the shared cache based on resource type. This is done using a configuration parameter in mapred-site.xml:
<property> <name>mapreduce.job.sharedcache.mode</name> <value>disabled</value> <description> A comma delimited list of resource categories to submit to the shared cache. The valid categories are: jobjar, libjars, files, archives. If "disabled" is specified then the job submission code will not use the shared cache. </description> </property>
If a resource type is listed, it will check the shared cache to see if the resource is already in the cache. If so, it will use the cached resource, if not, it will specify that the resource needs to be uploaded asynchronously.
A MapReduce user has 3 ways to specify resources for a MapReduce job:
It is important to ensure that each resource for a MapReduce job has a unique file name. This prevents symlink clobbering when YARN containers running MapReduce tasks are localized during container launch. A user can specify their own resource name by using the fragment portion of a URI. For example, for file resources specified on the command line, it could look like this:
-files /local/path/file1.txt#foo.txt,/local/path2/file1.txt#bar.txt
In the above example two files, named file1.txt, will be localized with two different names: foo.txt and bar.txt.
All resources in the shared cache have a PUBLIC visibility.
In the event that the shared cache manager is unavailable, the MapReduce client uses a fail-fast mechanism. If the MapReduce client fails to contact the shared cache manager, the client will no longer use the shared cache for the rest of that job submission. This prevents the MapReduce client from timing out each time it tries to check for a resource in the shared cache. The MapReduce client quickly reverts to the default behavior and submits a Job as if the shared cache was never enabled in the first place.