compute_sentence_piece_proto function
- Original Link : https://keras.io/api/keras_hub/tokenizers/compute_sentence_piece_proto/
- Last Checked at : 2024-11-26
compute_sentence_piece_proto
function
keras_hub.tokenizers.compute_sentence_piece_proto(
data, vocabulary_size, model_type="unigram", proto_output_file=None, lowercase=False
)
A utility to train a SentencePiece vocabulary.
Trains a SentencePiece vocabulary from an input dataset or a list of filenames.
If data
is a list of filenames, the file format is required to be plain
text files, and the text will be read in line by line during training.
Arguments
- data: A
tf.data.Dataset
, or a list of filenames. - vocabulary_size: int. The maximum size of a vocabulary to be trained.
- model_type: str. The model algorithm must be one of
"unigram"
,"bpe"
,"word"
or"char"
. Defaults to"unigram"
. - proto_output_file: str. If provided it will be used
as model_file which is passed to model_writer.
If
None
, the model_file will beio.BytesIO
object. Defaults toNone
. - lowercase: bool. If True, the input text will be
lowercased before tokenization. Defaults to
False
.
Returns
A bytes
object with a serialized SentencePiece proto or
None
if proto_output_file if provided.
Examples
Basic Usage (from Dataset).
>>> inputs = tf.data.Dataset.from_tensor_slices(["Drifting Along"])
>>> proto = keras_hub.tokenizers.compute_sentence_piece_proto(inputs, vocabulary_size=15)
>>> tokenizer = keras_hub.tokenizers.SentencePieceTokenizer(proto=proto)
>>> outputs = inputs.map(tokenizer)
>>> for output in outputs:
... print(output)
tf.Tensor([ 4 8 12 5 9 14 5 6 13 4 7 10 11 6 13],
shape=(15,), dtype=int32)
Basic Usage (with files).
with open("test.txt", "w+") as f: f.write("Drifting Along\n")
inputs = ["test.txt"]
proto = keras_hub.tokenizers.compute_sentence_piece_proto(
inputs, vocabulary_size=15, proto_output_file="model.spm")
tokenizer = keras_hub.tokenizers.SentencePieceTokenizer(proto="model.spm")
ds = tf.data.Dataset.from_tensor_slices(["the quick brown fox."])
ds = ds.map(tokenizer)
Usage with lowercase
>>> inputs = tf.data.Dataset.from_tensor_slices(["Drifting Along"])
>>> proto = keras_hub.tokenizers.compute_sentence_piece_proto(
... inputs, vocabulary_size=15, lowercase=True)
>>> tokenizer = keras_hub.tokenizers.SentencePieceTokenizer(proto=proto)
>>> outputs = inputs.map(tokenizer)
>>> for output in outputs:
... print(output)
tf.Tensor([ 4 8 12 5 9 14 5 6 13 4 7 10 11 6 13],
shape=(15,), dtype=int32)