Class SequenceFile

java.lang.Object
org.apache.hadoop.io.SequenceFile

@Public @Stable public class SequenceFile extends Object
SequenceFiles are flat files consisting of binary key/value pairs.

SequenceFile provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:
  1. Writer : Uncompressed records.
  2. RecordCompressWriter : Record-compressed files, only compress values.
  3. BlockCompressWriter : Block-compressed files, both keys & values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

The actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec.

The recommended way is to use the static createWriter methods provided by the SequenceFile to chose the preferred format.

The SequenceFile.Reader acts as the bridge and can read any of the above SequenceFile formats.

SequenceFile Formats

Essentially there are 3 different formats for SequenceFiles depending on the CompressionType specified. All of them share a common header described below.

  • version - 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
  • keyClassName -key class
  • valueClassName - value class
  • compression - A boolean which specifies if compression is turned on for keys/values in this file.
  • blockCompression - A boolean which specifies if block-compression is turned on for keys/values in this file.
  • compression codec - CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
  • metadata - SequenceFile.Metadata for this file.
  • sync - A sync marker to denote end of the header.

Uncompressed SequenceFile Format

  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Value
  • A sync-marker every few 100 kilobytes or so.

Record-Compressed SequenceFile Format

  • Header
  • Record
    • Record length
    • Key length
    • Key
    • Compressed Value
  • A sync-marker every few 100 kilobytes or so.
Block-Compressed SequenceFile Format
  • Header
  • Record Block
    • Uncompressed number of records in the block
    • Compressed key-lengths block-size
    • Compressed key-lengths block
    • Compressed keys block-size
    • Compressed keys block
    • Compressed value-lengths block-size
    • Compressed value-lengths block
    • Compressed values block-size
    • Compressed values block
  • A sync-marker every block.

The compressed blocks of key lengths and value lengths consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format.

See Also:
  • Field Details

    • SYNC_INTERVAL

      public static final int SYNC_INTERVAL
      The number of bytes between sync points. 100 KB, default. Computed as 5 KB * 20 = 100 KB
      See Also:
  • Method Details

    • getDefaultCompressionType

      public static SequenceFile.CompressionType getDefaultCompressionType(Configuration job)
      Get the compression type for the reduce outputs
      Parameters:
      job - the job config to look in
      Returns:
      the kind of compression to use
    • setDefaultCompressionType

      public static void setDefaultCompressionType(Configuration job, SequenceFile.CompressionType val)
      Set the default compression type for sequence files.
      Parameters:
      job - the configuration to modify
      val - the new compression type (none, block, record)
    • createWriter

      public static org.apache.hadoop.io.SequenceFile.Writer createWriter(Configuration conf, org.apache.hadoop.io.SequenceFile.Writer.Option... opts) throws IOException
      Create a new Writer with the given options.
      Parameters:
      conf - the configuration to use
      opts - the options to create the file with
      Returns:
      a new Writer
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType, Progressable progress) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      progress - The Progressable object to track progress.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType, CompressionCodec codec) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      codec - The compression codec.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType, CompressionCodec codec, Progressable progress, org.apache.hadoop.io.SequenceFile.Metadata metadata) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      codec - The compression codec.
      progress - The Progressable object to track progress.
      metadata - The metadata of the file.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass, int bufferSize, short replication, long blockSize, SequenceFile.CompressionType compressionType, CompressionCodec codec, Progressable progress, org.apache.hadoop.io.SequenceFile.Metadata metadata) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      bufferSize - buffer size for the underlaying outputstream.
      replication - replication factor for the file.
      blockSize - block size for the file.
      compressionType - The compression type.
      codec - The compression codec.
      progress - The Progressable object to track progress.
      metadata - The metadata of the file.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass, int bufferSize, short replication, long blockSize, boolean createParent, SequenceFile.CompressionType compressionType, CompressionCodec codec, org.apache.hadoop.io.SequenceFile.Metadata metadata) throws IOException
      Deprecated.
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      bufferSize - buffer size for the underlaying outputstream.
      replication - replication factor for the file.
      blockSize - block size for the file.
      createParent - create parent directory if non-existent
      compressionType - The compression type.
      codec - The compression codec.
      metadata - The metadata of the file.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileContext fc, Configuration conf, Path name, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType, CompressionCodec codec, org.apache.hadoop.io.SequenceFile.Metadata metadata, EnumSet<CreateFlag> createFlag, org.apache.hadoop.fs.Options.CreateOpts... opts) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fc - The context for the specified file.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      codec - The compression codec.
      metadata - The metadata of the file.
      createFlag - gives the semantics of create: overwrite, append etc.
      opts - file creation options; see Options.CreateOpts.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType, CompressionCodec codec, Progressable progress) throws IOException
      Construct the preferred type of SequenceFile Writer.
      Parameters:
      fs - The configured filesystem.
      conf - The configuration.
      name - The name of the file.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      codec - The compression codec.
      progress - The Progressable object to track progress.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType, CompressionCodec codec, org.apache.hadoop.io.SequenceFile.Metadata metadata) throws IOException
      Construct the preferred type of 'raw' SequenceFile Writer.
      Parameters:
      conf - The configuration.
      out - The stream on top which the writer is to be constructed.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      codec - The compression codec.
      metadata - The metadata of the file.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.
    • createWriter

      @Deprecated public static org.apache.hadoop.io.SequenceFile.Writer createWriter(Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, SequenceFile.CompressionType compressionType, CompressionCodec codec) throws IOException
      Construct the preferred type of 'raw' SequenceFile Writer.
      Parameters:
      conf - The configuration.
      out - The stream on top which the writer is to be constructed.
      keyClass - The 'key' type.
      valClass - The 'value' type.
      compressionType - The compression type.
      codec - The compression codec.
      Returns:
      Returns the handle to the constructed SequenceFile Writer.
      Throws:
      IOException - raised on errors performing I/O.