YARN Application Security

Anyone writing a YARN application needs to understand the process, in order to write short-lived applications or long-lived services. They also need to start testing on secure clusters during early development stages, in order to write code that actually works.

How YARN Security works

YARN Resource Managers (RMs) and Node Managers (NMs) co-operate to execute the user’s application with the identity and hence access rights of that user.

The (active) Resource Manager:

  1. Finds space in a cluster to deploy the core of the application, the Application Master (AM).

  2. Requests that the NM on that node allocate a container and start the AM in it.

  3. Communicates with the AM, so that the AM can request new containers and manipulate/release current ones, and to provide notifications about allocated and running containers.

The Node Managers:

  1. Localize resources: Download from HDFS or other filesystem into a local directory. This is done using the delegation tokens attached to the container launch context. (For non-HDFS resources, using other credentials such as object store login details in cluster configuration files)

  2. Start the application as the user.

  3. Monitor the application and report failure to the RM.

To execute code in the cluster, a YARN application must:

  1. Have a client-side application which sets up the ApplicationSubmissionContext detailing what is to be launched. This includes:

    • A list of files in the cluster’s filesystem to be “localized”.
    • The environment variables to set in the container.
    • The commands to execute in the container to start the application.
    • Any security credentials needed by YARN to launch the application.
    • Any security credentials needed by the application to interact with any Hadoop cluster services and applications.
  2. Have an Application Master which, when launched, registers with the YARN RM and listens for events. Any AM which wishes to execute work in other containers must request them off the RM, and, when allocated, create a ContainerLaunchContext containing the command to execute, the environment to execute the command, binaries to localize and all relevant security credentials.

  3. Even with the NM handling the localization process, the AM must itself be able to retrieve the security credentials supplied at launch time so that it itself may work with HDFS and any other services, and to pass some or all of these credentials down to the launched containers.

Acquiring and Adding tokens to a YARN Application

The delegation tokens which a YARN application needs must be acquired from a program executing as an authenticated user. For a YARN application, this means the user launching the application. It is the client-side part of the YARN application which must do this:

  1. Log in via UserGroupInformation.
  2. Identify all tokens which must be acquired.
  3. Request these tokens from the specific Hadoop services.
  4. Marshall all tokens into a byte buffer.
  5. Add them to the ContainerLaunchContext within the ApplicationSubmissionContext.

Which tokens are required? Normally, at least a token to access HDFS.

An application must request a delegation token from every filesystem with which it intends to interact —including the cluster’s main FS. FileSystem.addDelegationTokens(renewer, credentials) can be used to collect these; it is a no-op on those filesystems which do not issue tokens (including non-kerberized HDFS clusters).

Applications talking to other services, such as Apache HBase and Apache Hive, must request tokens from these services, using the libraries of these services to acquire delegation tokens. All tokens can be added to the same set of credentials, then saved to a byte buffer for submission.

The Application Timeline Server also needs a delegation token. This is handled automatically on AM launch.

Extracting tokens within the AM

When the Application Master is launched and any of the UGI/Hadoop operations which trigger a user login invoked, the UGI class will automatically load in all tokens saved in the file named by the environment variable HADOOP_TOKEN_FILE_LOCATION.

This happens on an insecure cluster along with a secure one, and on a secure cluster even if a keytab is used by the application. Why? Because the AM/RM token needed to authenticate the application with the YARN RM is always supplied this way.

This means you have a relative similar workflow across secure and insecure clusters.

  1. Suring AM startup, log in to Kerberos. A call to UserGroupInformation.isSecurityEnabled() will trigger this operation.

  2. Enumerate the current user’s credentials, through a call of UserGroupInformation.getCurrentUser().getCredentials().

  3. Filter out the AMRM token, resulting in a new set of credentials. In an insecure cluster, the list of credentials will now be empty; in a secure cluster they will contain

  4. Set the credentials of all containers to be launched to this (possibly empty) list of credentials.

  5. If the filtered list of tokens to renew, is non-empty start up a thread to renew them.

Token Renewal

Tokens expire: they have a limited lifespan. An application wishing to use a token past this expiry date must renew the token before the token expires.

Hadoop automatically sets up a delegation token renewal thread when needed, the DelegationTokenRenewer.

It is the responsibility of the application to renew all tokens other than the AMRM and timeline tokens.

Here are the different strategies

  1. Don’t. Rely on the lifespan of the application being so short that token renewal is not needed. For applications whose life can always be measured in minutes or tens of minutes, this is a viable strategy.

  2. Start a background thread/Executor to renew the tokens at a regular interval. This what most YARN applications do.

Other Aspects of YARN Security

AM/RM Token Refresh

The AM/RM token is renewed automatically; the AM pushes out a new token to the AM within an allocate message. Consult the AMRMClientImpl class to see the process. Your AM code does not need to worry about this process

Token Renewal on AM Restart

Even if an application is renewing tokens regularly, if an AM fails and is restarted, it gets restarted from that original ApplicationSubmissionContext. The tokens there may have expired, so localization may fail, even before the issue of credentials to talk to other services.

How is this problem addressed? The YARN Resource Manager gets a new token for the node managers, if needed.

More precisely

  1. The token passed by the RM to the NM for localization is refreshed/updated as needed.
  2. Tokens in the app launch context for use by the application are not refreshed. That is, if it has an out of date HDFS token —that token is not renewed. This also holds for tokens for for Hive, HBase, etc.
  3. Therefore, to survive AM restart after token expiry, your AM has to get the NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client.

This is primarily an issue for long-lived services (see below).

Unmanaged Application Masters

Unmanaged application masters are not launched in a container set up by the RM and NM, so cannot automatically pick up an AM/RM token at launch time. The YarnClient.getAMRMToken() API permits an Unmanaged AM to request an AM/RM token. Consult UnmanagedAMLauncher for the specifics.

Identity on an insecure cluster: HADOOP_USER_NAME

In an insecure cluster, the application will run as the identity of the account of the node manager, typically something such as yarn or mapred. By default, the application will access HDFS as that user, with a different home directory, and with a different user identified in audit logs and on file system owner attributes.

This can be avoided by having the client identify the identify of the HDFS/Hadoop user under which the application is expected to run. This does not affect the OS-level user or the application’s access rights to the local machine.

When Kerberos is disabled, the identity of a user is picked up by Hadoop first from the environment variable HADOOP_USER_NAME, then from the OS-level username (e.g. the system property user.name).

YARN applications should propagate the user name of the user launching an application by setting this environment variable.

Map<String, String> env = new HashMap<>();
String userName = UserGroupInformation.getCurrentUser().getUserName();
env.put(UserGroupInformation.HADOOP_USER_NAME, userName);
containerLaunchContext.setEnvironment(env);

Note that this environment variable is picked up in all applications which talk to HDFS via the hadoop libraries. That is, if set, it is the identity picked up by HBase and other applications executed within the environment of a YARN container within which this environment variable is set.

Oozie integration and HADOOP_TOKEN_FILE_LOCATION

Apache Oozie can launch an application in a secure cluster either by acquiring all relevant credentials, saving them to a file in the local filesystem, then setting the path to this file in the environment variable HADOOP_TOKEN_FILE_LOCATION. This is of course the same environment variable passed down by YARN in launched containers, as is similar content: a byte array with credentials.

Here, however, the environment variable is set in the environment executing the YARN client. This client must use the token information saved in the named file instead of acquiring any tokens of its own.

Loading in the token file is automatic: UGI does it during user login.

The client is then responsible for passing the same credentials into the AM launch context. This can be done simply by passing down the current credentials.

credentials = new Credentials(
    UserGroupInformation.getCurrentUser().getCredentials());

Timeline Server integration

The Application Timeline Server can be deployed as a secure service —in which case the application will need the relevant token to authenticate with it. This process is handled automatically in YarnClientImpl if ATS is enabled in a secure cluster. Similarly, the AM-side TimelineClient YARN service class manages token renewal automatically via the ATS’s SPNEGO-authenticated REST API.

If you need to prepare a set of delegation tokens for a YARN application launch via Oozie, this can be done via the timeline client API.

try(TimelineClient timelineClient = TimelineClient.createTimelineClient()) {
  timelineClient.init(conf);
  timelineClient.start();
  Token<TimelineDelegationTokenIdentifier> token =
      timelineClient.getDelegationToken(rmprincipal));
  credentials.addToken(token.getService(), token);
}

Cancelling Tokens

Applications may wish to cancel tokens they hold when terminating their AM. This ensures that the tokens are no-longer valid.

This is not mandatory, and as a clean shutdown of a YARN application cannot be guaranteed, it is not possible to guarantee that the tokens will always be during application termination. However, it does reduce the window of vulnerability to stolen tokens.

Securing Long-lived YARN Services

There is a time limit on all token renewals, after which tokens won’t renew, causing the application to stop working. This is somewhere between seventy-two hours and seven days.

Any YARN service intended to run for an extended period of time must have a strategy for renewing credentials.

Here are the strategies:

Pre-installed Keytabs for AM and containers

A keytab is provided for the application’s use on every node.

This is done by:

  1. Installing it in every cluster node’s local filesystem.
  2. Providing the path to this in a configuration option.
  3. The application loading the credentials via UserGroupInformation.loginUserFromKeytab().

The keytab must be in a secure directory path, where only the service (and other trusted accounts) can read it. Distribution becomes a responsibility of the cluster operations team.

This is effectively how all static Hadoop applications get their security credentials.

Keytabs for AM and containers distributed via YARN

  1. A keytab is uploaded to HDFS.

  2. When launching the AM, the keytab is listed as a resource to localize to the AM’s container.

  3. The Application Master is configured with the relative path to the keytab, and logs in with UserGroupInformation.loginUserFromKeytab().

  4. When the AM launches the container, it lists the HDFS path to the keytab as a resource to localize.

  5. It adds the HDFS delegation token to the container launch context, so that the keytab and other application files can be localized.

  6. Launched containers must themselves log in via UserGroupInformation.loginUserFromKeytab(). UGI handles the login, and schedules a background thread to relogin the user periodically.

  7. Token creation is handled automatically in the Hadoop IPC and REST APIs, the containers stay logged in via kerberos for their entire duration.

This avoids the administration task of installing keytabs for specific services across the entire cluster.

It does require the client to have access to the keytab and, as it is uploaded to the distributed filesystem, must be secured through the appropriate path permissions/ACLs.

As all containers have access to the keytab, all code executing in the containers has to be trusted. Malicious code (or code escaping some form of sandbox) could read the keytab, and hence have access to the cluster until the keys expire or are revoked.

This is the strategy implemented by Apache Slider (incubating).

AM keytab distributed via YARN; AM regenerates delegation tokens for containers.

  1. A keytab is uploaded to HDFS by the client.

  2. When launching the AM, the keytab is listed as a resource to localize to the AM’s container.

  3. The Application Master is configured with the relative path to the keytab, and logs in with UserGroupInformation.loginUserFromKeytab(). The UGI codepath will still automatically load the file references by $HADOOP_TOKEN_FILE_LOCATION, which is how the AMRM token is picked up.

  4. When the AM launches a container, it acquires all the delegation tokens needed by that container, and adds them to the container’s container launch context.

  5. Launched containers must load the delegation tokens from $HADOOP_TOKEN_FILE_LOCATION, and use them (including renewals) until they can no longer be renewed.

  6. The AM must implement an IPC interface which permits containers to request a new set of delegation tokens; this interface must itself use authentication and ideally wire encryption.

  7. Before a delegation token is due to expire, the processes running in the containers must request new tokens from the Application Master over the IPC channel.

  8. When the containers need the new tokens, the AM, logged in with a keytab, asks the various cluster services for new tokens.

(Note there is an alternative direction for refresh operations: from AM to the containers, again over whatever IPC channel is implemented between AM and containers). The rest of the algorithm: AM regenerated tokens passed to containers over IPC.

This is the strategy used by Apache Spark 1.5+, with a netty-based protocol between containers and the AM for token updates.

Because only the AM has direct access to the keytab, it is less exposed. Code running in the containers only has access to the delegation tokens.

However, those containers will have access to HDFS from the tokens passed in at container launch, so will have access to the copy of the keytab used for launching the AM. While the AM could delete that keytab on launch, doing so would stop YARN being able to successfully relaunch the AM after any failure.

Client-side Token Push

This strategy may be the sole one acceptable to a strict operations team: a client process running on an account holding a Kerberos TGT negotiates with all needed cluster services for new delegation tokens, tokens which are then pushed out to the Application Master via some RPC interface.

This does require the client process to be re-executed on a regular basis; a cron or Oozie job can do this. The AM will need to implement an IPC API over which renewed tokens can be provided. (Note that as Oozie can collect the tokens itself, all the updater application needs to do whenever executed is set up an IPC connection with the AM and pass up the current user’s credentials).

Securing YARN Application Web UIs and REST APIs

YARN provides a straightforward way of giving every YARN application SPNEGO authenticated web pages: it implements SPNEGO authentication in the Resource Manager Proxy.

YARN web UI are expected to load the AM proxy filter when setting up its web UI; this filter will redirect all HTTP Requests coming from any host other than the RM Proxy hosts to an RM proxy, to which the client app/browser must re-issue the request. The client will authenticate against the principal of the RM Proxy (usually yarn), and, once authenticated, have its request forwared.

As a result, all client interactions are SPNEGO-authenticated, without the YARN application itself needing any kerberos principal for the clients to authenticate against.

Known weaknesses in this approach are:

  1. As calls coming from the proxy hosts are not redirected, any application running on those hosts has unrestricted access to the YARN applications. This is why in a secure cluster the proxy hosts must run on cluster nodes which do not run end user code (i.e. not run YARN NodeManagers and hence schedule YARN containers, nor support logins by end users).

  2. The HTTP requests between proxy and YARN RM Server are not currently encrypted. That is: HTTPS is not supported.

Securing YARN Application REST APIs

YARN REST APIs running on the same port as the registered web UI of a YARN application are automatically authenticated via SPNEGO authentication in the RM proxy.

Any REST endpoint (and equally, any web UI) brought up on a different port does not support SPNEGO authentication unless implemented in the YARN application itself.

Checklist for YARN Applications

Here is the checklist of core actions which a YARN application must do to successfully launch in a YARN cluster.

Client

[ ] Client checks for security being enabled via UserGroupInformation.isSecurityEnabled()

In a secure cluster:

[ ] If HADOOP_TOKEN_FILE_LOCATION is unset, client acquires delegation tokens for the local filesystems, with the RM principal set as the renewer.

[ ] If HADOOP_TOKEN_FILE_LOCATION is unset, client acquires delegation tokens for all other services to be used in the YARN application.

[ ] If HADOOP_TOKEN_FILE_LOCATION is set, client uses the current user’s credentials as the source of all tokens to be added to the container launch context.

[ ] Client sets all tokens on AM ContainerLaunchContext.setTokens().

[ ] Recommended: if it is set in the client’s environment, client sets the environment variable HADOOP_JAAS_DEBUG=true in the Container launch context of the AM.

In an insecure cluster:

[ ] Propagate local username to YARN AM, hence HDFS identity via the HADOOP_USER_NAME environment variable.

App Master

[ ] In a secure cluster, AM retrieves security tokens from HADOOP_TOKEN_FILE_LOCATION environment variable (automatically done by UGI).

[ ] A copy the token set is filtered to remove the AM/RM token and any timeline token.

[ ] A thread or executor is started to renew threads on a regular basis.

[ ] Recommended: AM cancels tokens when application completes.

Container Launch by AM

[ ] Tokens to be passed to containers are passed via ContainerLaunchContext.setTokens().

[ ] In an insecure cluster, propagate the HADOOP_USER_NAME environment variable.

[ ] Recommended: AM sets the environment variable HADOOP_JAAS_DEBUG=true in the Container launch context if it is set in the AM’s environment.

Launched Containers

[ ] Call UserGroupInformation.isSecurityEnabled() to trigger security setup.

[ ] A thread or executor is started to renew threads on a regular basis.

YARN service

[ ] Application developers have chosen and implemented a token renewal strategy: shared keytab, AM keytab or client-side token refresh.

[ ] In a secure cluster, the keytab is either already in HDFS (and checked for), or it is in the local FS of the client, in which case it must be uploaded and added to the list of resources to localize.

[ ] If stored in HDFS, keytab permissions should be checked. If the keytab is readable by principals other than the current user, warn, and consider actually failing the launch (similar to the normal ssh application.)

[ ] Client acquires HDFS delegation token and and attaches to the AM Container Launch Context,

[ ] AM logs in as principal in keytab via loginUserFromKeytab().

[ ] (AM extracts AM/RM token from the HADOOP_TOKEN_FILE_LOCATION environment variable).

[ ] For launched containers, either the keytab is propagated, or the AM acquires/attaches all required delegation tokens to the Container Launch context alongside the HDFS delegation token needed by the NMs.

Testing YARN applications in a secure cluster.

It is straightforward to be confident that a YARN application works in secure cluster. The process to do so is: test on a secure cluster.

Even a single VM-cluster can be set up with security enabled. If doing so, we recommend turning security up to its strictest, with SPNEGO-authenticated Web UIs (and hence RM Proxy), as well as IPC wire encryption. Setting the kerberos token expiry to under an hour will find kerberos expiry problems early —so is also recommended.

[ ] Application launched in secure cluster.

[ ] Launched application runs as user submitting job (tip: log user.name system property in AM).

[ ] Web browser interaction verified in secure cluster.

[ ] REST client interation (GET operations) tested.

[ ] Application continues to run after Kerberos Token expiry.

[ ] Application does not launch if user lacks Kerberos credentials.

[ ] If the application supports the timeline server, verify that it publishes events in a secure cluster.

[ ] If the application integrates with other applications, such as HBase or Hive, verify that the interaction works in a secure cluster.

[ ] If the application communicates with remote HDFS clusters, verify that it can do so in a secure cluster (i.e. that the client extracted any delegation tokens for this at launch time)

Important

If you don’t test your YARN application in a secure Hadoop cluster, it won’t work.

And without those tests: your users will be the ones to find out that your application doesn’t work in a secure cluster.

Bear that in mind when considering how much development effort to put into Kerberos support.