UnicodeCodepointTokenizer
- 원본 링크 : https://keras.io/api/keras_nlp/tokenizers/unicode_codepoint_tokenizer/
- 최종 확인 : 2024-11-26
UnicodeCodepointTokenizer class
keras_nlp.tokenizers.UnicodeCodepointTokenizer(
sequence_length=None,
lowercase=True,
normalization_form=None,
errors="replace",
replacement_char=65533,
input_encoding="UTF-8",
output_encoding="UTF-8",
vocabulary_size=None,
dtype="int32",
**kwargs
)A unicode character tokenizer layer.
This tokenizer is a vocabulary free tokenizer which tokenizes text as unicode character codepoints.
Tokenizer outputs can either be padded and truncated with a
sequence_length argument, or left un-truncated. The exact output will
depend on the rank of the input tensors.
If input is a batch of strings (rank > 0):
By default, the layer will output a tf.RaggedTensor where the last
dimension of the output is ragged. If sequence_length is set, the layer
will output a dense tf.Tensor where all inputs have been padded or
truncated to sequence_length.
If input is a scalar string (rank == 0):
By default, the layer will output a dense tf.Tensor with static shape
[None]. If sequence_length is set, the output will be
a dense tf.Tensor of shape [sequence_length].
The output dtype can be controlled via the dtype argument, which should be
an integer type (“int16”, “int32”, etc.).
Arguments
- lowercase: If
True, the input text will be first lowered before tokenization. - sequence_length: If set, the output will be converted to a dense tensor and padded/trimmed so all outputs are of sequence_length.
- normalization_form: One of the following string values (None, ‘NFC’, ‘NFKC’, ‘NFD’, ‘NFKD’). If set will normalize unicode to the given form before tokenizing.
- errors: One of (‘replace’, ‘remove’, ‘strict’). Specifies the
detokenize()behavior when an invalid codepoint is encountered. The value of'strict'will cause the tokenizer to produce aInvalidArgumenterror on any invalid input formatting. A value of'replace'will cause the tokenizer to replace any invalid formatting in the input with the replacement_char codepoint. A value of'ignore'will cause the tokenizer to skip any invalid formatting in the input and produce no corresponding output character. - replacement_char: The unicode codepoint to use in place of invalid
codepoints. (U+FFFD) is
65533. Defaults to65533. - input_encoding: One of (“UTF-8”, “UTF-16-BE”, or “UTF-32-BE”).
One of The encoding of the input text. Defaults to
"UTF-8". - output_encoding: One of (“UTF-8”, “UTF-16-BE”, or “UTF-32-BE”).
The encoding of the output text. Defaults to
"UTF-8". - vocabulary_size: Set the vocabulary
vocabulary_size, by clamping all codepoints to the range [0, vocabulary_size). Effectively this will make thevocabulary_size - 1id the the OOV value.
Examples
Basic Usage.
>>> inputs = "Unicode Tokenizer"
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([117, 110, 105, 99, 111, 100, 101, 32, 116, 111, 107, 101, 110,
105, 122, 101, 114], dtype=int32)
Ragged outputs.
>>> inputs = ["पà¥à¤¸à¥à¤¤à¤", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([2346, 2369, 2360, 2381, 2340, 2325])
>>> np.array(seq2)
array([1705, 1578, 1575, 1576])
Dense outputs.
>>> inputs = ["पà¥à¤¸à¥à¤¤à¤", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=8)
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([2346, 2369, 2360, 2381, 2340, 2325, 0, 0], dtype=int32)
>>> np.array(seq2)
array([1705, 1578, 1575, 1576, 0, 0, 0, 0], dtype=int32)
Tokenize, then batch for ragged outputs.
>>> inputs = ["Book", "पà¥à¤¸à¥à¤¤à¤", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
[2346, 2369, 2360, 2381, 2340, 2325],
[1705, 1578, 1575, 1576]]>
Batch, then tokenize for ragged outputs.
>>> inputs = ["Book", "पà¥à¤¸à¥à¤¤à¤", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
[2346, 2369, 2360, 2381, 2340, 2325],
[1705, 1578, 1575, 1576]]>
Tokenize, then batch for dense outputs (sequence_length provided).
>>> inputs = ["Book", "पà¥à¤¸à¥à¤¤à¤", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[ 98, 111, 111, 107, 0],
[2346, 2369, 2360, 2381, 2340],
[1705, 1578, 1575, 1576, 0]], dtype=int32)>
Batch, then tokenize for dense outputs (sequence_length provided).
>>> inputs = ["Book", "पà¥à¤¸à¥à¤¤à¤", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[ 98, 111, 111, 107, 0],
[2346, 2369, 2360, 2381, 2340],
[1705, 1578, 1575, 1576, 0]], dtype=int32)>
Tokenization with truncation.
>>> inputs = ["I Like to Travel a Lot", "मà¥à¤ à¤à¤¿à¤¤à¤¾à¤¬à¥à¤ पढ़ना पसà¤à¤¦ à¤à¤°à¤¤à¤¾ हà¥à¤"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([[ 105, 32, 108, 105, 107],
[2350, 2376, 2306, 32, 2325]], dtype=int32)
Tokenization with vocabulary_size.
>>> latin_ext_cutoff = 592
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... vocabulary_size=latin_ext_cutoff)
>>> outputs = tokenizer("¿Cómo estás?")
>>> np.array(outputs)
array([191, 99, 243, 109, 111, 32, 101, 115, 116, 225, 115, 63],
dtype=int32)
>>> outputs = tokenizer("à¤à¤ª à¤à¥à¤¸à¥ हà¥à¤")
>>> np.array(outputs)
array([591, 591, 32, 591, 591, 591, 591, 32, 591, 591, 591],
dtype=int32)
Detokenization.
>>> inputs = tf.constant([110, 105, 110, 106, 97], dtype="int32")
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> tokenizer.detokenize(inputs)
'ninja'
Detokenization with padding.
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=7)
>>> dataset = tf.data.Dataset.from_tensor_slices(["a b c", "b c", "a"])
>>> dataset = dataset.map(tokenizer)
>>> dataset.take(1).get_single_element()
<tf.Tensor: shape=(7,), dtype=int32,
numpy=array([97, 32, 98, 32, 99, 0, 0], dtype=int32)>
>>> detokunbatched = dataset.map(tokenizer.detokenize)
>>> detokunbatched.take(1).get_single_element()
<tf.Tensor: shape=(), dtype=string, numpy=b'a b c'>
Detokenization with invalid bytes.
>>> inputs = tf.constant([110, 105, 10000000, 110, 106, 97])
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... errors="replace", replacement_char=88)
>>> tokenizer.detokenize(inputs)
'niXnja'
tokenize method
UnicodeCodepointTokenizer.tokenize(inputs)Transform input tensors of strings into output tokens.
Arguments
- inputs: Input tensor, or dict/list/tuple of input tensors.
- *args: Additional positional arguments.
- **kwargs: Additional keyword arguments.
detokenize method
UnicodeCodepointTokenizer.detokenize(inputs)Transform tokens back into strings.
Arguments
- inputs: Input tensor, or dict/list/tuple of input tensors.
- *args: Additional positional arguments.
- **kwargs: Additional keyword arguments.
get_vocabulary method
UnicodeCodepointTokenizer.get_vocabulary()Get the tokenizer vocabulary as a list of strings terms.
vocabulary_size method
UnicodeCodepointTokenizer.vocabulary_size()Get the size of the tokenizer vocabulary. None implies no vocabulary size was provided
token_to_id method
UnicodeCodepointTokenizer.token_to_id(token)Convert a string token to an integer id.
id_to_token method
UnicodeCodepointTokenizer.id_to_token(id)Convert an integer id to a string token.