Package org.apache.hadoop.io
Class Text
java.lang.Object
org.apache.hadoop.io.BinaryComparable
org.apache.hadoop.io.Text
- All Implemented Interfaces:
Comparable<BinaryComparable>,Writable,WritableComparable<BinaryComparable>
@Stringable
@Public
@Stable
public class Text
extends BinaryComparable
implements WritableComparable<BinaryComparable>
This class stores text using standard UTF8 encoding. It provides methods
to serialize, deserialize, and compare texts at byte level. The type of
length is integer and is serialized using zero-compressed format.
In addition, it provides methods for string traversal without converting the byte array to a string.
Also includes utilities for serializing/deserialing a string, coding/decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classorg.apache.hadoop.io.Text.ComparatorA WritableComparator optimized for Text keys. -
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidappend(byte[] utf8, int start, int len) Append a range of bytes to the end of the given text.static intbytesToCodePoint(ByteBuffer bytes) intcharAt(int position) Returns the Unicode Scalar Value (32-bit integer value) for the character atposition.voidclear()Clear the string to empty.byte[]static Stringdecode(byte[] utf8) static Stringdecode(byte[] utf8, int start, int length) static Stringdecode(byte[] utf8, int start, int length, boolean replace) static ByteBufferConverts the provided String to bytes using the UTF-8 encoding.static ByteBufferConverts the provided String to bytes using the UTF-8 encoding.booleanReturns true iffois a Text with the same length and same contents.intintFinds any occurrence ofwhatin the backing buffer, starting as positionstart.byte[]getBytes()Returns the raw bytes; however, only data up togetLength()is valid.intReturns the number of bytes in the byte array.intinthashCode()Return a hash of the bytes returned from {#getBytes()}.voidreadFields(DataInput in) Deserialize the fields of this object fromin.voidreadFields(DataInput in, int maxLength) static StringreadString(DataInput in) static StringreadString(DataInput in, int maxLength) voidreadWithKnownLength(DataInput in, int len) Read a Text object whose length is already known.voidset(byte[] utf8) Set to a utf8 byte array.voidset(byte[] utf8, int start, int len) Set the Text to range of bytes.voidSet to contain the contents of a string.voidCopy a text.static voidSkips over one Text in the input.toString()static intutf8Length(String string) For the given string, returns the number of UTF-8 bytes required to encode the string.static voidvalidateUTF8(byte[] utf8) Check if a byte array contains valid UTF-8.static voidvalidateUTF8(byte[] utf8, int start, int len) Check to see if a byte array is valid UTF-8.voidwrite(DataOutput out) Serialize.voidwrite(DataOutput out, int maxLength) static intwriteString(DataOutput out, String s) Write a UTF8 encoded string to out.static intwriteString(DataOutput out, String s, int maxLength) Methods inherited from class org.apache.hadoop.io.BinaryComparable
compareTo, compareToMethods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, waitMethods inherited from interface java.lang.Comparable
compareTo
-
Field Details
-
DEFAULT_MAX_LEN
public static final int DEFAULT_MAX_LEN- See Also:
-
-
Constructor Details
-
Text
public Text()Construct an empty text string. -
Text
Construct from a string.- Parameters:
string- input string.
-
Text
Construct from another text.- Parameters:
utf8- input utf8.
-
Text
public Text(byte[] utf8) Construct from a byte array.- Parameters:
utf8- input utf8.
-
-
Method Details
-
copyBytes
public byte[] copyBytes()- Returns:
- Get a copy of the bytes that is exactly the length of the data.
See
getBytes()for faster access to the underlying array.
-
getBytes
public byte[] getBytes()Returns the raw bytes; however, only data up togetLength()is valid. Please usecopyBytes()if you need the returned array to be precisely the length of the data.- Specified by:
getBytesin classBinaryComparable- Returns:
- getBytes.
-
getLength
public int getLength()Returns the number of bytes in the byte array.- Specified by:
getLengthin classBinaryComparable- Returns:
- length.
-
getTextLength
public int getTextLength()- Returns:
- Returns the length of this text. The length is equal to the number of Unicode code units in the text.
-
charAt
public int charAt(int position) Returns the Unicode Scalar Value (32-bit integer value) for the character atposition. Note that this method avoids using the converter or doing String instantiation.- Parameters:
position- input position.- Returns:
- the Unicode scalar value at position or -1 if the position is invalid or points to a trailing byte
-
find
-
find
Finds any occurrence ofwhatin the backing buffer, starting as positionstart. The starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing buffer is not converted to a string for this operation.- Parameters:
what- input what.start- input start.- Returns:
- byte position of the first occurrence of the search string in the UTF-8 buffer or -1 if not found
-
set
Set to contain the contents of a string.- Parameters:
string- input string.
-
set
public void set(byte[] utf8) Set to a utf8 byte array. If the length ofutf8is zero, actually clearbytesand any existing data is lost.- Parameters:
utf8- input utf8.
-
set
Copy a text.- Parameters:
other- other.
-
set
public void set(byte[] utf8, int start, int len) Set the Text to range of bytes.- Parameters:
utf8- the data to copy fromstart- the first position of the new stringlen- the number of bytes of the new string
-
append
public void append(byte[] utf8, int start, int len) Append a range of bytes to the end of the given text.- Parameters:
utf8- the data to copy fromstart- the first position to append from utf8len- the number of bytes to append
-
clear
public void clear()Clear the string to empty. Note: For performance reasons, this call does not clear the underlying byte array that is retrievable viagetBytes(). In order to free the byte-array memory, callset(byte[])with an empty byte array (For example,new byte[0]). -
toString
-
readFields
Description copied from interface:WritableDeserialize the fields of this object fromin.For efficiency, implementations should attempt to re-use storage in the existing object where possible.
- Specified by:
readFieldsin interfaceWritable- Parameters:
in-DataInputto deseriablize this object from.- Throws:
IOException- any other problem for readFields.
-
readFields
- Throws:
IOException
-
skip
Skips over one Text in the input.- Parameters:
in- input in.- Throws:
IOException- raised on errors performing I/O.
-
readWithKnownLength
Read a Text object whose length is already known. This allows creating Text from a stream which uses a different serialization format.- Parameters:
in- input in.len- input len.- Throws:
IOException- raised on errors performing I/O.
-
write
Serialize. Write this object to out length uses zero-compressed encoding.- Specified by:
writein interfaceWritable- Parameters:
out-DataOuputto serialize this object into.- Throws:
IOException- any other problem for write.- See Also:
-
write
- Throws:
IOException
-
equals
Returns true iffois a Text with the same length and same contents.- Overrides:
equalsin classBinaryComparable
-
hashCode
public int hashCode()Description copied from class:BinaryComparableReturn a hash of the bytes returned from {#getBytes()}.- Overrides:
hashCodein classBinaryComparable- See Also:
-
decode
- Parameters:
utf8- input utf8.- Returns:
- Converts the provided byte array to a String using the UTF-8 encoding. If the input is malformed, replace by a default value.
- Throws:
CharacterCodingException- when a character encoding or decoding error occurs.
-
decode
- Throws:
CharacterCodingException
-
decode
public static String decode(byte[] utf8, int start, int length, boolean replace) throws CharacterCodingException - Parameters:
utf8- input utf8.start- input start.length- input length.replace- input replace.- Returns:
- Converts the provided byte array to a String using the
UTF-8 encoding. If
replaceis true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException. - Throws:
CharacterCodingException- when a character encoding or decoding error occurs.
-
encode
Converts the provided String to bytes using the UTF-8 encoding. If the input is malformed, invalid chars are replaced by a default value.- Parameters:
string- input string.- Returns:
- ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
- Throws:
CharacterCodingException- when a character encoding or decoding error occurs.
-
encode
Converts the provided String to bytes using the UTF-8 encoding. Ifreplaceis true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.- Parameters:
string- input string.replace- input replace.- Returns:
- ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
- Throws:
CharacterCodingException- when a character encoding or decoding error occurs.
-
readString
- Parameters:
in- input in.- Returns:
- Read a UTF8 encoded string from in.
- Throws:
IOException- raised on errors performing I/O.
-
readString
- Parameters:
in- input datainput.maxLength- input maxLength.- Returns:
- Read a UTF8 encoded string with a maximum size.
- Throws:
IOException- raised on errors performing I/O.
-
writeString
Write a UTF8 encoded string to out.- Parameters:
out- input out.s- input s.- Returns:
- a UTF8 encoded string to out.
- Throws:
IOException- raised on errors performing I/O.
-
writeString
- Parameters:
out- input out.s- input s.maxLength- input maxLength.- Returns:
- Write a UTF8 encoded string with a maximum size to out.
- Throws:
IOException- raised on errors performing I/O.
-
validateUTF8
Check if a byte array contains valid UTF-8.- Parameters:
utf8- byte array- Throws:
MalformedInputException- if the byte array contains invalid UTF-8
-
validateUTF8
Check to see if a byte array is valid UTF-8.- Parameters:
utf8- the array of bytesstart- the offset of the first byte in the arraylen- the length of the byte sequence- Throws:
MalformedInputException- if the byte array contains invalid bytes
-
bytesToCodePoint
- Parameters:
bytes- input bytes.- Returns:
- Returns the next code point at the current position in the buffer. The buffer's position will be incremented. Any mark set on this buffer will be changed by this method!
-
utf8Length
For the given string, returns the number of UTF-8 bytes required to encode the string.- Parameters:
string- text to encode- Returns:
- number of UTF-8 bytes required to encode
-