ByteTokenizer
- 원본 링크 : https://keras.io/api/keras_hub/tokenizers/byte_tokenizer/
- 최종 확인 : 2024-11-26
ByteTokenizer class
keras_hub.tokenizers.ByteTokenizer(
lowercase=True,
sequence_length=None,
normalization_form=None,
errors="replace",
replacement_char=65533,
dtype="int32",
**kwargs
)Raw byte tokenizer.
This tokenizer is a vocabulary-free tokenizer which will tokenize text as as raw bytes from [0, 256).
Tokenizer outputs can either be padded and truncated with a
sequence_length argument, or left un-truncated. The exact output will
depend on the rank of the input tensors.
If input is a batch of strings:
By default, the layer will output a tf.RaggedTensor where the last
dimension of the output is ragged. If sequence_length is set, the layer
will output a dense tf.Tensor where all inputs have been padded or
truncated to sequence_length.
If input is a scalar string:
There are two cases here. If sequence_length is set, the output will be
a dense tf.Tensor of shape [sequence_length]. Otherwise, the output will
be a dense tf.Tensor of shape [None].
The output dtype can be controlled via the
dtype argument, which should be an integer type
(“int16”, “int32”, etc.).
Arguments
- lowercase: boolean. If True, the input text will be converted to lowercase before tokenization.
- sequence_length: int. If set, the output will be converted to a dense tensor and padded/trimmed so all outputs are of sequence_length.
- normalization_form: string. One of the following values: (None, “NFC”, “NFKC”, “NFD”, “NFKD”). If set, every UTF-8 string in the input tensor text will be normalized to the given form before tokenizing.
- errors: One of (‘replace’, ‘remove’, ‘strict’). Specifies the
detokenize()behavior when an invalid tokenizer is encountered. The value of'strict'will cause the operation to produce aInvalidArgumenterror on any invalid input formatting. A value of'replace'will cause the tokenizer to replace any invalid formatting in the input with thereplacement_charcodepoint. A value of'ignore'will cause the tokenizer to skip any invalid formatting in the input and produce no corresponding output character. - replacement_char: int. The replacement character to
use when an invalid byte sequence is encountered and when
errorsis set to “replace” (same behaviour as https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode). (U+FFFD) is65533. Defaults to65533.
Examples
Basic usage.
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer()
>>> outputs = tokenizer("hello")
>>> np.array(outputs)
array([104, 101, 108, 108, 111], dtype=int32)
Ragged outputs.
>>> inputs = ["hello", "hi"]
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer()
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([104, 101, 108, 108, 111])
>>> np.array(seq2)
array([104, 105])
Dense outputs.
>>> inputs = ["hello", "hi"]
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer(sequence_length=8)
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([104, 101, 108, 108, 111, 0, 0, 0], dtype=int32)
>>> np.array(seq2)
array([104, 105, 0, 0, 0, 0, 0, 0], dtype=int32)
Tokenize, then batch for ragged outputs.
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(2))
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[104, 101, 108, 108, 111], [102, 117, 110]]>
Batch, then tokenize for ragged outputs.
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.batch(2).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[104, 101, 108, 108, 111], [102, 117, 110]]>
Tokenize, then batch for dense outputs (sequence_length provided).
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer(sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(2))
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(2, 5), dtype=int32, numpy=
array([[104, 101, 108, 108, 111],
[102, 117, 110, 0, 0]], dtype=int32)>
Batch, then tokenize for dense outputs. (sequence_length provided).
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer(sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.batch(2).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(2, 5), dtype=int32, numpy=
array([[104, 101, 108, 108, 111],
[102, 117, 110, 0, 0]], dtype=int32)>
Detokenization.
>>> inputs = [104, 101, 108, 108, 111]
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer()
>>> tokenizer.detokenize(inputs)
'hello'
Detokenization with invalid bytes.
>>> # The 255 below is invalid utf-8.
>>> inputs = [104, 101, 255, 108, 108, 111]
>>> tokenizer = keras_hub.tokenizers.ByteTokenizer(
... errors="replace", replacement_char=88)
>>> tokenizer.detokenize(inputs)
'heXllo'
tokenize method
ByteTokenizer.tokenize(inputs)Transform input tensors of strings into output tokens.
Arguments
- inputs: Input tensor, or dict/list/tuple of input tensors.
- *args: Additional positional arguments.
- **kwargs: Additional keyword arguments.
detokenize method
ByteTokenizer.detokenize(inputs)Transform tokens back into strings.
Arguments
- inputs: Input tensor, or dict/list/tuple of input tensors.
- *args: Additional positional arguments.
- **kwargs: Additional keyword arguments.
get_vocabulary method
ByteTokenizer.get_vocabulary()Get the tokenizer vocabulary as a list of strings terms.
vocabulary_size method
ByteTokenizer.vocabulary_size()Get the integer size of the tokenizer vocabulary.
token_to_id method
ByteTokenizer.token_to_id(token)Convert a string token to an integer id.
id_to_token method
ByteTokenizer.id_to_token(id)Convert an integer id to a string token.