WordPieceTokenizer
- 원본 링크 : https://keras.io/api/keras_nlp/tokenizers/word_piece_tokenizer/
- 최종 확인 : 2024-11-26
WordPieceTokenizer
class
keras_nlp.tokenizers.WordPieceTokenizer(
vocabulary=None,
sequence_length=None,
lowercase=False,
strip_accents=False,
split=True,
split_on_cjk=True,
suffix_indicator="##",
oov_token="[UNK]",
special_tokens=None,
special_tokens_in_strings=False,
dtype="int32",
**kwargs
)
A WordPiece tokenizer layer.
This layer provides an efficient, in graph, implementation of the WordPiece algorithm used by BERT and other models.
To make this layer more useful out of the box, the layer will pre-tokenize
the input, which will optionally lower-case, strip accents, and split the
input on whitespace and punctuation. Each of these pre-tokenization steps is
not reversible. The detokenize
method will join words with a space, and
will not invert tokenize
exactly.
If a more custom pre-tokenization step is desired, the layer can be
configured to apply only the strict WordPiece algorithm by passing
lowercase=False
, strip_accents=False
and split=False
. In
this case, inputs should be pre-split string tensors or ragged tensors.
Tokenizer outputs can either be padded and truncated with a
sequence_length
argument, or left un-truncated. The exact output will
depend on the rank of the input tensors.
If input is a batch of strings (rank > 0):
By default, the layer will output a tf.RaggedTensor
where the last
dimension of the output is ragged. If sequence_length
is set, the layer
will output a dense tf.Tensor
where all inputs have been padded or
truncated to sequence_length
.
If input is a scalar string (rank == 0):
By default, the layer will output a dense tf.Tensor
with static shape
[None]
. If sequence_length
is set, the output will be
a dense tf.Tensor
of shape [sequence_length]
.
The output dtype can be controlled via the dtype
argument, which should
be either an integer or string type.
Arguments
- vocabulary: A list of strings or a string filename path. If passing a list, each element of the list should be a single WordPiece token string. If passing a filename, the file should be a plain text file containing a single WordPiece token per line.
- sequence_length: int. If set, the output will be converted to a dense tensor and padded/trimmed so all outputs are of sequence_length.
- lowercase: bool. If
True
, the input text will be lowercased before tokenization. Defaults toFalse
. - strip_accents: bool. If
True
, all accent marks will be removed from text before tokenization. Defaults toFalse
. - split: bool. If
True
, input will be split on whitespace and punctuation marks, and all punctuation marks will be kept as tokens. IfFalse
, input should be split (“pre-tokenized”) before calling the tokenizer, and passed as a dense or ragged tensor of whole words. Defaults toTrue
. - split_on_cjk: bool. If True, input will be split
on CJK characters, i.e., Chinese, Japanese, Korean and Vietnamese
characters (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)).
Note that this is applicable only when
split
is True. Defaults toTrue
. - suffix_indicator: str. The characters prepended to a
WordPiece to indicate that it is a suffix to another subword.
E.g. “##ing”. Defaults to
"##"
. - oov_token: str. The string value to substitute for
an unknown token. It must be included in the vocab.
Defaults to
"[UNK]"
. - special_tokens_in_strings: bool. A bool to indicate if the tokenizer should expect special tokens in input strings that should be tokenized and mapped correctly to their ids. Defaults to False.
References
Examples
Ragged outputs.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... lowercase=True,
... )
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([1, 2, 3, 4, 5, 6, 7], dtype=int32)
Dense outputs.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = ["The quick brown fox."]
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... sequence_length=10,
... lowercase=True,
... )
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([[1, 2, 3, 4, 5, 6, 7, 0, 0, 0]], dtype=int32)
String output.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... lowercase=True,
... dtype="string",
... )
>>> tokenizer(inputs)
['the', 'qu', '##ick', 'br', '##own', 'fox', '.']
Detokenization.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... lowercase=True,
... )
>>> tokenizer.detokenize(tokenizer.tokenize(inputs))
'the quick brown fox .'
Custom splitting.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The$quick$brown$fox"
>>> tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
... vocabulary=vocab,
... split=False,
... lowercase=True,
... dtype='string',
... )
>>> split_inputs = tf.strings.split(inputs, sep="$")
>>> tokenizer(split_inputs)
['the', 'qu', '##ick', 'br', '##own', 'fox']
tokenize
method
WordPieceTokenizer.tokenize(inputs)
Transform input tensors of strings into output tokens.
Arguments
- inputs: Input tensor, or dict/list/tuple of input tensors.
- *args: Additional positional arguments.
- **kwargs: Additional keyword arguments.
detokenize
method
WordPieceTokenizer.detokenize(inputs)
Transform tokens back into strings.
Arguments
- inputs: Input tensor, or dict/list/tuple of input tensors.
- *args: Additional positional arguments.
- **kwargs: Additional keyword arguments.
get_vocabulary
method
WordPieceTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings tokens.
vocabulary_size
method
WordPieceTokenizer.vocabulary_size()
Get the integer size of the tokenizer vocabulary.
token_to_id
method
WordPieceTokenizer.token_to_id(token)
Convert a string token to an integer id.
id_to_token
method
WordPieceTokenizer.id_to_token(id)
Convert an integer id to a string token.