These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.
The build image has been upgraded to Bionic.
ZKFC binds host address to “dfs.namenode.servicerpc-bind-host”, if configured. Otherwise, it binds to “dfs.namenode.rpc-bind-host”. If neither of those is configured, ZKFC binds itself to NameNode RPC server address (effectively “dfs.namenode.rpc-address”).
Azure ABFS support for Shared Access Signatures (SAS)
Distcp block size is not preserved by default, unless -pb is specified. This restores the behavior prior to Hadoop 3.
ViewFS#listStatus on root(“/”) considers listing from fallbackLink if available. If the same directory name is present in configured mount path as well as in fallback link, then only the configured mount path will be listed in the returned result.
Enable balancer to redirect getBlocks request to a Standby Namenode, thus reducing the performance impact of balancer to the Active NameNode.
The feature is disabled by default. To enable it, configure the hdfs-site.xml of balancer: dfs.ha.allow.stale.reads = true.
Azure Blob File System (ABFS) SAS Generator Update
Azure WASB bug fix that can cause list results to appear empty.
Remove unnecessary symlink resolution in S3AFileSystem globStatus
The S3A connector now has an option to stop deleting directory markers as files are written. This eliminates the IO throttling the operations can cause, and avoids creating tombstone markers on versioned S3 buckets.
This feature is incompatible with all versions of Hadoop which lack the HADOOP-17199 change to list and getFileStatus calls.
Consult the S3A documentation for further details
ABFS: Support for conditional overwrite.
Improved node registration with node health status.
The SnappyCodec uses the snappy-java compression library, rather than explicitly referencing native binaries. It contains the native libraries for many operating systems and instruction sets, falling back to a pure java implementation. It does requires the snappy-java.jar is on the classpath. It can be found in hadoop-common/lib, and has already been present as part of the avro dependencies
The configuration dfs.image.transfer.bandwidthPerSec which defines the maximum bandwidth available for fsimage transfer is changed from 0 (meaning no throttle at all) to 50MB/s.
“hadoop fs” has a concat command. Available on all filesystems which support the concat API including HDFS and WebHDFS
The Hadoop’s LZ4 compression codec now depends on lz4-java. The native LZ4 is performed by the encapsulated JNI and it is no longer necessary to install and configure the lz4 system package.
The lz4-java is declared in provided scope. Applications that wish to use lz4 codec must declare dependency on lz4-java explicitly.
The option “fs.creation.parallel.count” sets a a semaphore to throttle the number of FileSystem instances which can be created simultaneously.
This is designed to reduce the impact of many threads in an application calling FileSystem.get() on a filesystem which takes time to instantiate -for example to an object where HTTPS connections are set up during initialization. Many threads trying to do this may create spurious delays by conflicting for access to synchronized blocks, when simply limiting the parallelism diminishes the conflict, so speeds up all threads trying to access the store.
The default value, 64, is larger than is likely to deliver any speedup -but it does mean that there should be no adverse effects from the change.
If a service appears to be blocking on all threads initializing connections to abfs, s3a or store, try a smaller (possibly significantly smaller) value.
WARNING: No release note provided for this change.
WARNING: No release note provided for this change.
ABFS: The default value for “fs.azure.list.max.results” was changed from 500 to 5000.
The default value of the configuration hadoop.http.idle_timeout.ms (how long does Jetty disconnect an idle connection) is changed from 10000 to 60000. This property is inlined during compile time, so an application that references this property must be recompiled in order for it to take effect.
S3A bucket existence check is disabled (fs.s3a.bucket.probe is 0), so there will be no existence check on the bucket during the S3AFileSystem initialization. The first operation which attempts to interact with the bucket which will fail if the bucket does not exist.
the s3a filesystem will link against the unshaded AWS s3 SDK. Making an application’s dependencies consistent with that SDK is left as exercise. Note: native openssl is not supported as a socket factory in unshaded deployments.
The S3A connector’s rename() operation now raises FileNotFoundException if the source doesn’t exist; FileAlreadyExistsException if the destination is unsuitable. It no longer checks for a parent directory existing -instead it simply verifies that there is no file immediately above the destination path.
Added a -useiterator option in distcp which uses listStatusIterator for building the listing. Primarily to reduce memory usage at client for building listing.
Removed findbugs from the hadoop build images and added spotbugs instead. Upgraded SpotBugs to 4.2.2 and spotbugs-maven-plugin to 4.2.0.
DFS client can use the newly added URI cache when creating socket address for read operations. By default it is disabled. When enabled, creating socket address will use cached URI object based on host:port to reduce the frequency of URI object creation.
To enable it, set the following config key to true: <property> <name>dfs.client.read.uri.cache.enabled</name> <value>true</value> </property>
Adds auto-reload of keystore.
Adds below new config (default 10 seconds):
ssl.{0}.stores.reload.interval
The refresh interval used to check if either of the truststore or keystore certificate file has changed.
The default quota initialization thread count during the NameNode startup process (dfs.namenode.quota.init-threads) is increased from 4 to 12.
This JIRA changes public fields in DFSHedgedReadMetrics. If you are using the public member variables of DFSHedgedReadMetrics, you need to use them through the public API.
The S3A output streams now raise UnsupportedOperationException on calls to Syncable.hsync() or Syncable.hflush(). This is to make absolutely clear to programs trying to use the syncable API that the stream doesn’t save any data at all until close. Programs which use this to flush their write ahead logs will fail immediately, rather than appear to succeed but without saving any data.
To downgrade the API calls to simply printing a warning, set fs.s3a.downgrade.syncable.exceptions" to true. This will not change the other behaviour: no data is saved.
Object stores are not filesystems.