This document captures the compatibility goals of the Apache Hadoop project. The different types of compatibility between Hadoop releases that affect Hadoop developers, downstream projects, and end-users are enumerated. For each type of compatibility this document will:
All Hadoop interfaces are classified according to the intended audience and stability in order to maintain compatibility with previous releases. See the Hadoop Interface Taxonomy for details about the classifications.
This document is intended for consumption by the Hadoop developer community. This document describes the lens through which changes to the Hadoop project should be viewed. In order for end users and third party developers to have confidence about cross-release compatibility, the developer community must ensure that development efforts adhere to these policies. It is the responsibility of the project committers to validate that all changes either maintain compatibility or are explicitly marked as incompatible.
Within a component Hadoop developers are free to use Private and Limited Private APIs, but when using components from a different module Hadoop developers should follow the same guidelines as third-party developers: do not use Private or Limited Private (unless explicitly allowed) interfaces and prefer instead Stable interfaces to Evolving or Unstable interfaces where possible. Where not possible, the preferred solution is to expand the audience of the API rather than introducing or perpetuating an exception to these compatibility guidelines. When working within a Maven module Hadoop developers should observe where possible the same level of restraint with regard to using components located in other Maven modules.
Above all, Hadoop developers must be mindful of the impact of their changes. Stable interfaces must not change between major releases. Evolving interfaces must not change between minor releases. New classes and components must be labeled appropriately for audience and stability. See the Hadoop Interface Taxonomy for details about when the various labels are appropriate. As a general rule, all new interfaces and APIs should have the most limited labels (e.g. Private Unstable) that will not inhibit the intent of the interface or API.
This document is arranged in sections according to the various compatibility concerns. Within each section an introductory text explains what compatibility means in that section, why it’s important, and what the intent to support compatibility is. The subsequent “Policy” section then sets forth in specific terms what the governing policy is.
The key words “MUST” “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” are to be interpreted as described in RFC 2119.
The Java API provides a @Deprecated annotation to mark an API element as flagged for removal. The standard meaning of the annotation is that the API element should not be used and may be removed in a later version.
In all cases removing an element from an API is an incompatible change. The stability of the element SHALL determine when such a change is permissible. A Stable element MUST be marked as deprecated for a full major release before it can be removed and SHALL NOT be removed in a minor or maintenance release. An Evolving element MUST be marked as deprecated for a full minor release before it can be removed and SHALL NOT be removed during a maintenance release. An Unstable element MAY be removed at any time. When possible an Unstable element SHOULD be marked as deprecated for at least one release before being removed. For example, if a method is marked as deprecated in Hadoop 2.8, it cannot be removed until Hadoop 4.0.
Stable API elements MUST NOT be removed until they have been marked as deprecated (through the @Deprecated annotation or other appropriate documentation) for a full major release. In the case that an API element was introduced as deprecated (to indicate that it is a temporary measure that is intended to be removed) the API element MAY be removed in the following major release. When modifying a Stable API, developers SHOULD prefer introducing a new method or endpoint and deprecating the existing one to making incompatible changes to the method or endpoint.
Developers SHOULD annotate all Hadoop interfaces and classes with the @InterfaceAudience and @InterfaceStability annotations to describe the intended audience and stability.
Annotations MAY be applied at the package, class, or method level. If a method has no privacy or stability annotation, it SHALL inherit its intended audience or stability level from the class to which it belongs. If a class has no privacy or stability annotation, it SHALL inherit its intended audience or stability level from the package to which it belongs. If a package has no privacy or stability annotation, it SHALL be assumed to be Private and Unstable, respectively.
In the event that an element’s audience or stability annotation conflicts with the corresponding annotation of its parent (whether explicit or inherited), the element’s audience or stability (respectively) SHALL be determined by the more restrictive annotation. For example, if a Private method is contained in a Public class, then the method SHALL be treated as Private. If a Public method is contained in a Private class, the method SHALL be treated as Private.
The compatibility policy SHALL be determined by the relevant package, class, or member variable or method annotations.
Note: APIs generated from the proto files MUST be compatible for rolling upgrades. See the section on wire protocol compatibility for more details. The compatibility policies for APIs and wire protocols must therefore go hand in hand.
Apache Hadoop strives to ensure that the behavior of APIs remains consistent across releases, though changes for correctness may result in changes in behavior. API behavior SHALL be specified by the JavaDoc API documentation where present and complete. When JavaDoc API documentation is not available, behavior SHALL be specified by the behavior expected by the related unit tests. In cases with no JavaDoc API documentation or unit test coverage, the expected behavior is presumed to be obvious and SHOULD be assumed to be the minimum functionality implied by the interface naming. The community is in the process of specifying some APIs more rigorously and enhancing test suites to verify compliance with the specification, effectively creating a formal specification for the subset of behaviors that can be easily tested.
The behavior of any API MAY be changed to fix incorrect behavior according to the stability of the API, with such a change to be accompanied by updating existing documentation and tests and/or adding new documentation or tests.
Apache Hadoop revisions SHOULD retain binary compatability such that end-user applications continue to work without any modifications. Minor Apache Hadoop revisions within the same major revision MUST retain compatibility such that existing MapReduce applications (e.g. end-user applications and projects such as Apache Pig, Apache Hive, et al), existing YARN applications (e.g. end-user applications and projects such as Apache Spark, Apache Tez et al), and applications that accesses HDFS directly (e.g. end-user applications and projects such as Apache HBase, Apache Flume, et al) work unmodified and without recompilation when used with any Apache Hadoop cluster within the same major release as the original build target.
For MapReduce applications in particular, i.e. applications using the org.apache.hadoop.mapred and/or org.apache.hadoop.mapreduce APIs, the developer community SHALL support binary compatibility across major releases. The MapReduce APIs SHALL be supported compatibly across major releases. See Compatibility for MapReduce applications between hadoop-1.x and hadoop-2.x for more details.
Some applications may be affected by changes to disk layouts or other internal changes. See the sections that follow for policies on how incompatible changes to non-API interfaces are handled.
Hadoop includes several native components, including compression, the container executor binary, and various native integrations. These native components introduce a set of native dependencies for Hadoop, both at compile time and at runtime, such as cmake, gcc, zlib, etc. This set of native dependencies is part of the Hadoop ABI.
The minimum required versions of the native components on which Hadoop depends at compile time and/or runtime SHALL be considered Evolving. The minimum required versions SHOULD NOT increase between minor releases within a major version, though updates because of security issues, license issues, or other reasons MAY occur. When the native components on which Hadoop depends must be updated between minor releases within a major release, where possible the changes SHOULD only change the minor versions of the components without changing the major versions.
Wire compatibility concerns data being transmitted “over the wire” between Hadoop processes. Hadoop uses Protocol Buffers for most RPC communication. Preserving compatibility requires prohibiting modification as described below. Non-RPC communication should be considered as well, for example using HTTP to transfer an HDFS image as part of snapshotting or transferring MapReduce map task output. The communications can be categorized as follows:
The components of Apache Hadoop may have dependencies that include their own protocols, such as Zookeeper, S3, Kerberos, etc. These protocol dependencies SHALL be treated as internal protocols and governed by the same policy.
In addition to compatibility of the protocols themselves, maintaining cross-version communications requires that the transports supported also be stable. The most likely source of transport changes stems from secure transports, such as SSL. Upgrading a service from SSLv2 to SSLv3 may break existing SSLv2 clients. The minimum supported major version of any transports SHOULD NOT increase between minor releases within a major version, though updates because of security issues, license issues, or other reasons MAY occur. When a transport must be updated between minor releases within a major release, where possible the changes SHOULD only change the minor versions of the components without changing the major versions.
Service ports are considered as part of the transport mechanism. Default service port numbers must be kept consistent to prevent breaking clients.
Hadoop wire protocols are defined in .proto (ProtocolBuffers) files. Client-Server and Server-Server protocols SHALL be classified according to the audience and stability classifications noted in their .proto files. In cases where no classifications are present, the protocols SHOULD be assumed to be Private and Stable.
The following changes to a .proto file SHALL be considered compatible:
The following changes to a .proto file SHALL be considered incompatible:
The following changes to a .proto file SHALL be considered incompatible:
Hadoop wire protocols that are not defined via .proto files SHOULD be considered to be Private and Stable.
In addition to the limitations imposed by being Stable, Hadoop’s wire protocols MUST also be forward compatible across minor releases within a major version according to the following:
New transport mechanisms MUST only be introduced with minor or major version changes. Existing transport mechanisms MUST continue to be supported across minor versions within a major version. Default service port numbers SHALL be considered Stable.
REST API compatibility applies to the exposed REST endpoints (URLs) and response data format. Hadoop REST APIs are specifically meant for stable use by clients across releases, even major ones. For purposes of this document, an exposed PEST API is one that is documented in the public documentation. The following is a non-exhaustive list of the exposed REST APIs:
Each API has an API-specific version number. Any incompatible changes MUST increment the API version number.
The exposed Hadoop REST APIs SHALL be considered Public and Evolving. With respect to API version numbers, the exposed Hadoop REST APIs SHALL be considered Public and Stable, i.e. no incompatible changes are allowed to within an API version number. A REST API version must be labeled as deprecated for a full major release before it can be removed.
The Hadoop daemons and CLIs produce log output via Log4j that is intended to aid administrators and developers in understanding and troubleshooting cluster behavior. Log messages are intended for human consumption, though automation use cases are also supported.
All log output SHALL be considered Public and Unstable. For log output, an incompatible change is one that renders a parser unable to find or recognize a line of log output.
Several components have audit logging systems that record system information in a machine readable format. Incompatible changes to that data format may break existing automation utilities. For the audit log, an incompatible change is any change that changes the format such that existing parsers no longer can parse the logs.
All audit log output SHALL be considered Public and Stable. Any change to the data format SHALL be considered an incompatible change.
While the Metrics API compatibility is governed by Java API compatibility, the Metrics data format exposed by Hadoop MUST be maintained as compatible for consumers of the data, e.g. for automation tasks.
The data format exposed via Metrics SHALL be considered Public and Stable.
User and system level data (including metadata) is stored in files of various formats. Changes to the metadata or the file formats used to store data/metadata can lead to incompatibilities between versions. Each class of file format is addressed below.
Changes to formats that end users use to store their data can prevent them from accessing the data in later releases, and hence are important to be compatible. Examples of these formats include har, war, SequenceFileFormat, etc.
User-level file formats SHALL be considered Public and Stable. User-lever file format changes SHOULD be made forward compatible across major releases and MUST be made forward compatible within a major release. The developer community SHOULD prefer the creation of a new derivative file format to making incompatible changes to an existing file format. Such new file formats MUST be created as opt-in, meaning that users must be able to continue using the existing compatible format until and unless they explicitly opt in to using the new file format.
Hadoop internal data may also be stored in files or other data stores. Changing the schemas of these data stores can lead to incompatibilities.
MapReduce uses formats like I-File to store MapReduce-specific data.
All MapReduce-internal file formats, such as I-File format or the job history server’s jhist file format, SHALL be considered Private and Stable.
HDFS persists metadata (the image and edit logs) in a private file format. Incompatible changes to either the format or the metadata prevent subsequent releases from reading older metadata. Incompatible changes must include a process by which existing metadata may be upgraded.
Depending on the degree of incompatibility in the changes, the following potential scenarios can arise:
HDFS data nodes store data in a private directory structure. Incompatible changes to the directory structure may prevent older releases from accessing stored data. Incompatible changes must include a process by which existing data directories may be upgraded.
The HDFS metadata format SHALL be considered Private and Evolving. Incompatible changes MUST include a process by which existing metadata may be upgraded. The upgrade process SHALL be allowed to require more than one upgrade. The upgrade process MUST allow the cluster metadata to be rolled back to the older version and its older disk format. The rollback MUST restore the original data but is not REQUIRED to restore the updated data. Any incompatible change to the format MUST result in the major version number of the schema being incremented.
The data node directory format SHALL be considered Private and Evolving. Incompatible changes MUST include a process by which existing data directories may be upgraded. The upgrade process SHALL be allowed to require more than one upgrade. The upgrade process MUST allow the data directories to be rolled back to the older layout.
The S3Guard metastore used to store metadata in DynamoDB tables; as such it had to maintain a compatibility strategy. Now that S3Guard is removed, the tables are not needed.
Applications configured to use an S3A metadata store other than the “null” store will fail.
The YARN resource manager stores information about the cluster state in an external state store for use in fail over and recovery. If the schema used for the state store data does not remain compatible, the resource manager will not be able to recover its state and will fail to start. The state store data schema includes a version number that indicates compatibility.
The YARN resource manager state store data schema SHALL be considered Private and Evolving. Any incompatible change to the schema MUST result in the major version number of the schema being incremented. Any compatible change to the schema MUST result in the minor version number being incremented.
The YARN node manager stores information about the node state in an external state store for use in recovery. If the schema used for the state store data does not remain compatible, the node manager will not be able to recover its state and will fail to start. The state store data schema includes a version number that indicates compatibility.
The YARN node manager state store data schema SHALL be considered Private and Evolving. Any incompatible change to the schema MUST result in the major version number of the schema being incremented. Any compatible change to the schema MUST result in the minor version number being incremented.
The YARN resource manager federation service stores information about the federated clusters, running applications, and routing policies in an external state store for use in replication and recovery. If the schema used for the state store data does not remain compatible, the federation service will fail to initialize. The state store data schema includes a version number that indicates compatibility.
The YARN federation service state store data schema SHALL be considered Private and Evolving. Any incompatible change to the schema MUST result in the major version number of the schema being incremented. Any compatible change to the schema MUST result in the minor version number being incremented.
The Hadoop command line programs may be used either directly via the system shell or via shell scripts. The CLIs include both the user-facing commands, such as the hdfs command or the yarn command, and the admin-facing commands, such as the scripts used to start and stop daemons. Changing the path of a command, removing or renaming command line options, the order of arguments, or the command return codes and output break compatibility and adversely affect users.
All Hadoop CLI paths, usage, and output SHALL be considered Public and Stable unless documented as experimental and subject to change.
Note that the CLI output SHALL be considered distinct from the log output generated by the Hadoop CLIs. The latter SHALL be governed by the policy on log output. Note also that for CLI output, all changes SHALL be considered incompatible changes.
Web UI, particularly the content and layout of web pages, changes could potentially interfere with attempts to screen scrape the web pages for information. The Hadoop Web UI pages, however, are not meant to be scraped, e.g. for automation purposes. Users are expected to use REST APIs to programmatically access cluster information.
Users depend on the behavior of a Hadoop cluster remaining consistent across releases. Changes which cause unexpectedly different behaviors from the cluster can lead to frustration and long adoption cycles. No new configuration should be added which changes the behavior of an existing cluster, assuming the cluster’s configuration files remain unchanged. For any new settings that are defined, care should be taken to ensure that the new setting does not change the behavior of existing clusters.
Changes to existing functionality MUST NOT change the default behavior or the meaning of existing configuration settings between maintenance releases within the same minor version, regardless of whether the changes arise from changes to the system or logic or to internal or external default configuration values.
Changes to existing functionality SHOULD NOT change the default behavior or the meaning of existing configuration settings between minor releases within the same major version, though changes, such as to fix correctness or security issues, may require incompatible behavioral changes. Where possible such behavioral changes SHOULD be off by default.
Users use Hadoop-defined properties to configure and provide hints to Hadoop and custom properties to pass information to jobs. Users are encouraged to avoid using custom configuration property names that conflict with the namespace of Hadoop-defined properties and should avoid using any prefixes used by Hadoop, e.g. hadoop, io, ipc, fs, net, file, ftp, kfs, ha, file, dfs, mapred, mapreduce, and yarn.
In addition to properties files, Hadoop uses other configuration files to set system behavior, such as the fair scheduler configuration file or the resource profiles configuration file.
Hadoop-defined properties (names and meanings) SHALL be considered Public and Stable. The units implied by a Hadoop-defined property MUST NOT change, even across major versions. Default values of Hadoop-defined properties SHALL be considered Public and Evolving.
Hadoop configuration files that are not governed by the above rules about Hadoop-defined properties SHALL be considered Public and Stable. The definition of an incompatible change depends on the particular configuration file format, but the general rule is that a compatible change will allow a configuration file that was valid before the change to remain valid after the change.
The log output produced by Hadoop daemons and CLIs is governed by a set of configuration files. These files control the minimum level of log message that will be output by the various components of Hadoop, as well as where and how those messages are stored.
All Log4j configurations SHALL be considered Public and Evolving.
Source code, artifacts (source and tests), user logs, configuration files, output, and job history are all stored on disk on either the local file system or HDFS. Changing the directory structure of these user-accessible files can break compatibility, even in cases where the original path is preserved via symbolic links (such as when the path is accessed by a servlet that is configured to not follow symbolic links).
The layout of source code and build artifacts SHALL be considered Private and Unstable. Within a major version, the developer community SHOULD preserve the overall directory structure, though individual files MAY be added, moved, or deleted with no warning.
The directory structure of configuration files, user logs, and job history SHALL be considered Public and Evolving.
Hadoop provides several client artifacts that applications use to interact with the system. These artifacts typically have their own dependencies on common libraries. In the cases where these dependencies are exposed to end user applications or downstream consumers (i.e. not shaded) changes to these dependencies can be disruptive. Developers are strongly encouraged to avoid exposing dependencies to clients by using techniques such as shading.
With regard to dependencies, adding a dependency is an incompatible change, whereas removing a dependency is a compatible change.
Some user applications built against Hadoop may add all Hadoop JAR files (including Hadoop’s library dependencies) to the application’s classpath. Adding new dependencies or updating the versions of existing dependencies may interfere with those in applications’ classpaths and hence their correct operation. Users are therefore discouraged from adopting this practice.
The set of dependencies exposed by the Hadoop client artifacts SHALL be considered Public and Stable. Any dependencies that are not exposed to clients (either because they are shaded or only exist in non-client artifacts) SHALL be considered Private and Unstable
Users and related projects often utilize the environment variables exported by Hadoop (e.g. HADOOP_CONF_DIR). Removing or renaming environment variables can therefore impact end user applications.
The environment variables consumed by Hadoop and the environment variables made accessible to applications through YARN SHALL be considered Public and Evolving. The developer community SHOULD limit changes to major releases.
Hadoop uses Maven for project management. Changes to the contents of generated artifacts can impact existing user applications.
The contents of Hadoop test artifacts SHALL be considered Private and Unstable. Test artifacts include all JAR files generated from test source code and all JAR files that include “tests” in the file name.
The Hadoop client artifacts SHALL be considered Public and Stable. Client artifacts are the following:
All other build artifacts SHALL be considered Private and Unstable.
To keep up with the latest advances in hardware, operating systems, JVMs, and other software, new Hadoop releases may include features that require newer hardware, operating systems releases, or JVM versions than previous Hadoop releases. For a specific environment, upgrading Hadoop might require upgrading other dependent software components.
Here are some relevant JIRAs and pages related to the topic: