This guide provides an overview of the HDFS Observer NameNode feature and how to configure/install it in a typical HA-enabled cluster. For a detailed technical design overview, please check the doc attached to HDFS-12943.
In a HA-enabled HDFS cluster (for more information, check HDFSHighAvailabilityWithQJM), there is a single Active NameNode and one or more Standby NameNode(s). The Active NameNode is responsible for serving all client requests, while Standby NameNode just keep the up-to-date information regarding the namespace, by tailing edit logs from JournalNodes, as well as block location information, by receiving block reports from all the DataNodes. One drawback of this architecture is that the Active NameNode could be a single bottle-neck and be overloaded with client requests, especially in a busy cluster.
The Consistent Reads from HDFS Observer NameNode feature addresses the above by introducing a new type of NameNode called Observer NameNode. Similar to Standby NameNode, Observer NameNode keeps itself up-to-date regarding the namespace and block location information. In addition, it also has the ability to serve consistent reads, like Active NameNode. Since read requests are the majority in a typical environment, this can help to load balancing the NameNode traffic and improve overall throughput.
In the new architecture, a HA cluster could consists of namenodes in 3 different states: active, standby and observer. State transition can happen between active and standby, standby and observer, but not directly between active and observer.
To ensure read-after-write consistency within a single client, a state ID, which is implemented using transaction ID within NameNode, is introduced in RPC headers. When a client performs write through Active NameNode, it updates its state ID using the latest transaction ID from the NameNode. When performing a subsequent read, the client passes this state ID to Observer NameNode, which will then check against its own transaction ID, and will ensure its own transaction ID has caught up with the request’s state ID, before serving the read request. This ensures “read your own writes” semantics from a single client. Maintaining consistency between multiple clients in the face of out-of-band communication is discussed in the “Maintaining Client Consistency” section below.
Edit log tailing is critical for Observer NameNode as it directly affects the latency between when a transaction is applied in Active NameNode and when it is applied in the Observer NameNode. A new edit log tailing mechanism, named “Edit Tailing Fast-Path”, is introduced to significantly reduce this latency. This is built on top of the existing in-progress edit log tailing feature, with further improvements such as RPC-based tailing instead of HTTP, a in-memory cache on the JournalNode, etc. For more details, please see the design doc attached to HDFS-13150.
New client-side proxy providers are also introduced. ObserverReadProxyProvider, which inherits the existing ConfiguredFailoverProxyProvider, should be used to replace the latter to enable reads from Observer NameNode. When submitting a client read request, the proxy provider will first try each Observer NameNode available in the cluster, and only fall back to Active NameNode if all of the former failed. Similarly, ObserverReadProxyProviderWithIPFailover is introduced to replace IPFailoverProxyProvider in a IP failover setup.
As discussed above, a client ‘foo’ will update its state ID upon every request to the Active NameNode, which includes all write operations. Any request directed to an Observer NameNode will wait until the Observer has seen this transaction ID, ensuring that the client is able to read all of its own writes. However, if ‘foo’ sends an out-of-band (i.e., non-HDFS) message to client ‘bar’ telling it that a write has been performed, a subsequent read by ‘bar’ may not see the recent write by ‘foo’. To prevent this inconsistent behavior, a new msync()
, or “metadata sync”, command has been added. When msync()
is called on a client, it will update its state ID against the Active NameNode – a very lightweight operation – so that subsequent reads are guaranteed to be consistent up to the point of the msync()
. Thus as long as ‘bar’ calls msync()
before performing its read, it is guaranteed to see the write made by ‘foo’.
To make use of msync()
, an application does not necessarily have to make any code changes. Upon startup, a client will automatically call msync()
before performing any reads against an Observer, so that any writes performed prior to the initialization of the client will be visible. In addition, there is a configurable “auto-msync” mode supported by ObserverReadProxyProvider which will automatically perform an msync()
at some configurable interval, to prevent a client from ever seeing data that is more stale than a time bound. There is some overhead associated with this, as each refresh requires an RPC to the Active NameNode, so it is disabled by default.
To enable consistent reads from Observer NameNode, you’ll need to add a few configurations to your hdfs-site.xml:
This will lead to NameNode creating alignment context instance, which keeps track of current server state id. Server state id will be carried back to client. It is disabled by default to optimize performance of Observer read cases. But this is required to be turned on for the Observer NameNode feature.
<property> <name>dfs.namenode.state.context.enabled</name> <value>true</value> </property>
This enables fast edit log tailing through in-progress edit logs and also other mechanisms such as RPC-based edit log fetching, in-memory cache in JournalNodes, and so on. It is disabled by default, but is required to be turned on for the Observer NameNode feature.
<property> <name>dfs.ha.tail-edits.in-progress</name> <value>true</value> </property>
This determines the staleness of Observer NameNode w.r.t the Active. If too large, RPC time will increase as client requests will wait longer in the RPC queue before Observer tails edit logs and catches up the latest state of Active. The default value is 1min. It is highly recommend to configure this to a much lower value. It is also recommended to configure backoff to be enabled when using low values; please see below.
<property> <name>dfs.ha.tail-edits.period</name> <value>0ms</value> </property>
This determines the behavior of a Standby/Observer when it attempts to tail edits from the JournalNodes and finds no edits available. This is a common situation when the edit tailing period is very low, but the cluster is not heavily loaded. Without this configuration, such a situation will cause high utilization on the Standby/Observer as it constantly attempts to read edits even though there are none available. With this configuration enabled, exponential backoff will be performed when an edit tail attempt returns 0 edits. This configuration specifies the maximum time to wait between edit tailing attempts.
<property> <name>dfs.ha.tail-edits.period.backoff-max</name> <value>10s</value> </property>
This is the size, in bytes, of the in-memory cache for storing edits on the JournalNode side. The cache is used for serving edits via RPC-based tailing. This is only effective when dfs.ha.tail-edits.in-progress is turned on.
<property> <name>dfs.journalnode.edit-cache-size.bytes</name> <value>1048576</value> </property>
Used to calculate the size of the edits cache that is kept in the JournalNode’s memory. This config is an alternative to the dfs.journalnode.edit-cache-size.bytes. And it is used to serve edits for tailing via the RPC-based mechanism, and is only enabled when dfs.ha.tail-edits.in-progress is true. Transactions range in size but are around 200 bytes on average, so the default of 1MB can store around 5000 transactions. So we can configure a reasonable value based on the maximum memory. The recommended value is less than 0.9. If we set dfs.journalnode.edit-cache-size.bytes, this parameter will not take effect.
<property> <name>dfs.journalnode.edit-cache-size.fraction</name> <value>0.5f</value> </property>
It is highly recommended to disable this configuration. If enabled, this will turn a getBlockLocations
call into a write call, as it needs to hold write lock to update the time for the opened file. Therefore, the request will fail on all Observer NameNodes and fall back to the active eventually. As result, RPC performance will degrade.
<property> <name>dfs.namenode.accesstime.precision</name> <value>0</value> </property>
A new HA admin command is introduced to transition a Standby NameNode into observer state:
haadmin -transitionToObserver
Note this can only be executed on Standby NameNode. Exception will be thrown when invoking this on Active NameNode.
Similarly, existing transitionToStandby can also be run on an Observer NameNode, which transition it to the standby state.
NOTE: the feature for Observer NameNode to participate in failover is not implemented yet. Therefore, as described in the next section, you should only use transitionToObserver to bring up an observer. ZKFC could be turned on the Observer NameNode, but it doesn’t do anything when the NameNode is in Observer state. ZKFC will participate in the election of Active after the NameNode is transitioned to standby state.
To enable observer support, first you’ll need a HA-enabled HDFS cluster with more than 2 namenodes. Then, you need to transition Standby NameNode(s) into the observer state. An minimum setup would be running 3 namenodes in the cluster, one active, one standby and one observer. For large HDFS clusters we recommend running two or more Observers depending on the intensity of read requests and HA requirements.
Note that currently Observer NameNode doesn’t integrate fully when automatic failover is enabled. If the dfs.ha.automatic-failover.enabled is turned on, the only benefit for running ZKFC on Observer NameNode is that it will automatically join election of Active after you transition the NameNode to Standby. If this is not desired, you can disable ZKFC on the Observer NameNode. In addition to that, you’ll also need to add forcemanual flag to the transitionToObserver command:
haadmin -transitionToObserver -forcemanual
In future, this restriction will be lifted.
Clients who wish to use Observer NameNode for read accesses can specify the ObserverReadProxyProvider class for proxy provider implementation, in the client-side hdfs-site.xml configuration file:
<property> <name>dfs.client.failover.proxy.provider.<nameservice></name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider</value> </property>
Clients who do not wish to use Observer NameNode can still use the existing ConfiguredFailoverProxyProvider and should not see any behavior change.
Clients who wish to make use of the “auto-msync” functionality should adjust the configuration below. This will specify some time period after which, if the client’s state ID has not been updated from the Active NameNode, an msync()
will automatically be performed. If this is specified as 0, an msync()
will be performed before every read operation. If this is a positive time duration, an msync()
will be performed every time a read operation is requested and the Active has not been contacted for longer than that period. If this is negative (the default), no automatic msync()
will be performed.
<property> <name>dfs.client.failover.observer.auto-msync-period.<nameservice></name> <value>500ms</value> </property>