Class Text

All Implemented Interfaces:
Comparable<BinaryComparable>, Writable, WritableComparable<BinaryComparable>

@Stringable @Public @Stable public class Text extends BinaryComparable implements WritableComparable<BinaryComparable>
This class stores text using standard UTF8 encoding. It provides methods to serialize, deserialize, and compare texts at byte level. The type of length is integer and is serialized using zero-compressed format.

In addition, it provides methods for string traversal without converting the byte array to a string.

Also includes utilities for serializing/deserialing a string, coding/decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    org.apache.hadoop.io.Text.Comparator
    A WritableComparator optimized for Text keys.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    Construct an empty text string.
    Text(byte[] utf8)
    Construct from a byte array.
    Text(String string)
    Construct from a string.
    Text(Text utf8)
    Construct from another text.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    append(byte[] utf8, int start, int len)
    Append a range of bytes to the end of the given text.
    static int
     
    int
    charAt(int position)
    Returns the Unicode Scalar Value (32-bit integer value) for the character at position.
    void
    Clear the string to empty.
    byte[]
     
    static String
    decode(byte[] utf8)
     
    static String
    decode(byte[] utf8, int start, int length)
     
    static String
    decode(byte[] utf8, int start, int length, boolean replace)
     
    static ByteBuffer
    encode(String string)
    Converts the provided String to bytes using the UTF-8 encoding.
    static ByteBuffer
    encode(String string, boolean replace)
    Converts the provided String to bytes using the UTF-8 encoding.
    boolean
    Returns true iff o is a Text with the same length and same contents.
    int
    find(String what)
     
    int
    find(String what, int start)
    Finds any occurrence of what in the backing buffer, starting as position start.
    byte[]
    Returns the raw bytes; however, only data up to getLength() is valid.
    int
    Returns the number of bytes in the byte array.
    int
     
    int
    Return a hash of the bytes returned from {#getBytes()}.
    void
    Deserialize the fields of this object from in.
    void
    readFields(DataInput in, int maxLength)
     
    static String
     
    static String
    readString(DataInput in, int maxLength)
     
    void
    Read a Text object whose length is already known.
    void
    set(byte[] utf8)
    Set to a utf8 byte array.
    void
    set(byte[] utf8, int start, int len)
    Set the Text to range of bytes.
    void
    set(String string)
    Set to contain the contents of a string.
    void
    set(Text other)
    Copy a text.
    static void
    Skips over one Text in the input.
     
    static int
    For the given string, returns the number of UTF-8 bytes required to encode the string.
    static void
    validateUTF8(byte[] utf8)
    Check if a byte array contains valid UTF-8.
    static void
    validateUTF8(byte[] utf8, int start, int len)
    Check to see if a byte array is valid UTF-8.
    void
    Serialize.
    void
    write(DataOutput out, int maxLength)
     
    static int
    Write a UTF8 encoded string to out.
    static int
    writeString(DataOutput out, String s, int maxLength)
     

    Methods inherited from class org.apache.hadoop.io.BinaryComparable

    compareTo, compareTo

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait

    Methods inherited from interface java.lang.Comparable

    compareTo
  • Field Details

  • Constructor Details

    • Text

      public Text()
      Construct an empty text string.
    • Text

      public Text(String string)
      Construct from a string.
      Parameters:
      string - input string.
    • Text

      public Text(Text utf8)
      Construct from another text.
      Parameters:
      utf8 - input utf8.
    • Text

      public Text(byte[] utf8)
      Construct from a byte array.
      Parameters:
      utf8 - input utf8.
  • Method Details

    • copyBytes

      public byte[] copyBytes()
      Returns:
      Get a copy of the bytes that is exactly the length of the data. See getBytes() for faster access to the underlying array.
    • getBytes

      public byte[] getBytes()
      Returns the raw bytes; however, only data up to getLength() is valid. Please use copyBytes() if you need the returned array to be precisely the length of the data.
      Specified by:
      getBytes in class BinaryComparable
      Returns:
      getBytes.
    • getLength

      public int getLength()
      Returns the number of bytes in the byte array.
      Specified by:
      getLength in class BinaryComparable
      Returns:
      length.
    • getTextLength

      public int getTextLength()
      Returns:
      Returns the length of this text. The length is equal to the number of Unicode code units in the text.
    • charAt

      public int charAt(int position)
      Returns the Unicode Scalar Value (32-bit integer value) for the character at position. Note that this method avoids using the converter or doing String instantiation.
      Parameters:
      position - input position.
      Returns:
      the Unicode scalar value at position or -1 if the position is invalid or points to a trailing byte
    • find

      public int find(String what)
    • find

      public int find(String what, int start)
      Finds any occurrence of what in the backing buffer, starting as position start. The starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing buffer is not converted to a string for this operation.
      Parameters:
      what - input what.
      start - input start.
      Returns:
      byte position of the first occurrence of the search string in the UTF-8 buffer or -1 if not found
    • set

      public void set(String string)
      Set to contain the contents of a string.
      Parameters:
      string - input string.
    • set

      public void set(byte[] utf8)
      Set to a utf8 byte array. If the length of utf8 is zero, actually clear bytes and any existing data is lost.
      Parameters:
      utf8 - input utf8.
    • set

      public void set(Text other)
      Copy a text.
      Parameters:
      other - other.
    • set

      public void set(byte[] utf8, int start, int len)
      Set the Text to range of bytes.
      Parameters:
      utf8 - the data to copy from
      start - the first position of the new string
      len - the number of bytes of the new string
    • append

      public void append(byte[] utf8, int start, int len)
      Append a range of bytes to the end of the given text.
      Parameters:
      utf8 - the data to copy from
      start - the first position to append from utf8
      len - the number of bytes to append
    • clear

      public void clear()
      Clear the string to empty. Note: For performance reasons, this call does not clear the underlying byte array that is retrievable via getBytes(). In order to free the byte-array memory, call set(byte[]) with an empty byte array (For example, new byte[0]).
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • readFields

      public void readFields(DataInput in) throws IOException
      Description copied from interface: Writable
      Deserialize the fields of this object from in.

      For efficiency, implementations should attempt to re-use storage in the existing object where possible.

      Specified by:
      readFields in interface Writable
      Parameters:
      in - DataInput to deseriablize this object from.
      Throws:
      IOException - any other problem for readFields.
    • readFields

      public void readFields(DataInput in, int maxLength) throws IOException
      Throws:
      IOException
    • skip

      public static void skip(DataInput in) throws IOException
      Skips over one Text in the input.
      Parameters:
      in - input in.
      Throws:
      IOException - raised on errors performing I/O.
    • readWithKnownLength

      public void readWithKnownLength(DataInput in, int len) throws IOException
      Read a Text object whose length is already known. This allows creating Text from a stream which uses a different serialization format.
      Parameters:
      in - input in.
      len - input len.
      Throws:
      IOException - raised on errors performing I/O.
    • write

      public void write(DataOutput out) throws IOException
      Serialize. Write this object to out length uses zero-compressed encoding.
      Specified by:
      write in interface Writable
      Parameters:
      out - DataOuput to serialize this object into.
      Throws:
      IOException - any other problem for write.
      See Also:
    • write

      public void write(DataOutput out, int maxLength) throws IOException
      Throws:
      IOException
    • equals

      public boolean equals(Object o)
      Returns true iff o is a Text with the same length and same contents.
      Overrides:
      equals in class BinaryComparable
    • hashCode

      public int hashCode()
      Description copied from class: BinaryComparable
      Return a hash of the bytes returned from {#getBytes()}.
      Overrides:
      hashCode in class BinaryComparable
      See Also:
    • decode

      public static String decode(byte[] utf8) throws CharacterCodingException
      Parameters:
      utf8 - input utf8.
      Returns:
      Converts the provided byte array to a String using the UTF-8 encoding. If the input is malformed, replace by a default value.
      Throws:
      CharacterCodingException - when a character encoding or decoding error occurs.
    • decode

      public static String decode(byte[] utf8, int start, int length) throws CharacterCodingException
      Throws:
      CharacterCodingException
    • decode

      public static String decode(byte[] utf8, int start, int length, boolean replace) throws CharacterCodingException
      Parameters:
      utf8 - input utf8.
      start - input start.
      length - input length.
      replace - input replace.
      Returns:
      Converts the provided byte array to a String using the UTF-8 encoding. If replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.
      Throws:
      CharacterCodingException - when a character encoding or decoding error occurs.
    • encode

      public static ByteBuffer encode(String string) throws CharacterCodingException
      Converts the provided String to bytes using the UTF-8 encoding. If the input is malformed, invalid chars are replaced by a default value.
      Parameters:
      string - input string.
      Returns:
      ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
      Throws:
      CharacterCodingException - when a character encoding or decoding error occurs.
    • encode

      public static ByteBuffer encode(String string, boolean replace) throws CharacterCodingException
      Converts the provided String to bytes using the UTF-8 encoding. If replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.
      Parameters:
      string - input string.
      replace - input replace.
      Returns:
      ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
      Throws:
      CharacterCodingException - when a character encoding or decoding error occurs.
    • readString

      public static String readString(DataInput in) throws IOException
      Parameters:
      in - input in.
      Returns:
      Read a UTF8 encoded string from in.
      Throws:
      IOException - raised on errors performing I/O.
    • readString

      public static String readString(DataInput in, int maxLength) throws IOException
      Parameters:
      in - input datainput.
      maxLength - input maxLength.
      Returns:
      Read a UTF8 encoded string with a maximum size.
      Throws:
      IOException - raised on errors performing I/O.
    • writeString

      public static int writeString(DataOutput out, String s) throws IOException
      Write a UTF8 encoded string to out.
      Parameters:
      out - input out.
      s - input s.
      Returns:
      a UTF8 encoded string to out.
      Throws:
      IOException - raised on errors performing I/O.
    • writeString

      public static int writeString(DataOutput out, String s, int maxLength) throws IOException
      Parameters:
      out - input out.
      s - input s.
      maxLength - input maxLength.
      Returns:
      Write a UTF8 encoded string with a maximum size to out.
      Throws:
      IOException - raised on errors performing I/O.
    • validateUTF8

      public static void validateUTF8(byte[] utf8) throws MalformedInputException
      Check if a byte array contains valid UTF-8.
      Parameters:
      utf8 - byte array
      Throws:
      MalformedInputException - if the byte array contains invalid UTF-8
    • validateUTF8

      public static void validateUTF8(byte[] utf8, int start, int len) throws MalformedInputException
      Check to see if a byte array is valid UTF-8.
      Parameters:
      utf8 - the array of bytes
      start - the offset of the first byte in the array
      len - the length of the byte sequence
      Throws:
      MalformedInputException - if the byte array contains invalid bytes
    • bytesToCodePoint

      public static int bytesToCodePoint(ByteBuffer bytes)
      Parameters:
      bytes - input bytes.
      Returns:
      Returns the next code point at the current position in the buffer. The buffer's position will be incremented. Any mark set on this buffer will be changed by this method!
    • utf8Length

      public static int utf8Length(String string)
      For the given string, returns the number of UTF-8 bytes required to encode the string.
      Parameters:
      string - text to encode
      Returns:
      number of UTF-8 bytes required to encode