MaskedLMMaskGenerator layer
- Original Link : https://keras.io/api/keras_nlp/preprocessing_layers/masked_lm_mask_generator/
- Last Checked at : 2024-11-26
MaskedLMMaskGenerator
class
keras_nlp.layers.MaskedLMMaskGenerator(
vocabulary_size,
mask_selection_rate,
mask_token_id,
mask_selection_length=None,
unselectable_token_ids=[0],
mask_token_rate=0.8,
random_token_rate=0.1,
**kwargs
)
Layer that applies language model masking.
This layer is useful for preparing inputs for masked language modeling (MaskedLM) tasks. It follows the masking strategy described in the original BERT paper. Given tokenized text, it randomly selects certain number of tokens for masking. Then for each selected token, it has a chance (configurable) to be replaced by “mask token” or random token, or stay unchanged.
Input data should be passed as tensors, tf.RaggedTensor
s, or lists. For
batched input, inputs should be a list of lists or a rank two tensor. For
unbatched inputs, each element should be a list or a rank one tensor.
This layer can be used with tf.data
to generate dynamic masks on the fly
during training.
Arguments
- vocabulary_size: int, the size of the vocabulary.
- mask_selection_rate: float, the probability of a token is selected for masking.
- mask_token_id: int. The id of mask token.
- mask_selection_length: int. Maximum number of tokens
selected for masking in each sequence. If set, the output
mask_positions
,mask_ids
andmask_weights
will be padded to dense tensors of lengthmask_selection_length
, otherwise the output will be a RaggedTensor. Defaults toNone
. - unselectable_token_ids: A list of tokens id that should not be
considered eligible for masking. By default, we assume
0
corresponds to a padding token and ignore it. Defaults to[0]
. - mask_token_rate: float.
mask_token_rate
must be between 0 and 1 which indicates how often the mask_token is substituted for tokens selected for masking. Defaults to0.8
. - random_token_rate: float.
random_token_rate
must be between 0 and 1 which indicates how often a random token is substituted for tokens selected for masking. Note: mask_token_rate + random_token_rate <= 1, and for (1 - mask_token_rate - random_token_rate), the token will not be changed. Defaults to0.1
.
Returns
- A Dict with 4 keys:
token_ids: Tensor or RaggedTensor, has the same type and shape of
input. Sequence after getting masked.
mask_positions: Tensor, or RaggedTensor if
mask_selection_length
is None. The positions of token_ids getting masked. mask_ids: Tensor, or RaggedTensor ifmask_selection_length
is None. The original token ids at masked positions. mask_weights: Tensor, or RaggedTensor ifmask_selection_length
is None.mask_weights
has the same shape asmask_positions
andmask_ids
. Each element inmask_weights
should be 0 or 1, 1 means the corresponding position inmask_positions
is an actual mask, 0 means it is a pad.
Examples
Basic usage.
masker = keras_hub.layers.MaskedLMMaskGenerator(
vocabulary_size=10,
mask_selection_rate=0.2,
mask_token_id=0,
mask_selection_length=5
)
# Dense input.
masker([1, 2, 3, 4, 5])
# Ragged input.
masker([[1, 2], [1, 2, 3, 4]])
Masking a batch that contains special tokens.
pad_id, cls_id, sep_id, mask_id = 0, 1, 2, 3
batch = [
[cls_id, 4, 5, 6, sep_id, 7, 8, sep_id, pad_id, pad_id],
[cls_id, 4, 5, sep_id, 6, 7, 8, 9, sep_id, pad_id],
]
masker = keras_hub.layers.MaskedLMMaskGenerator(
vocabulary_size = 10,
mask_selection_rate = 0.2,
mask_selection_length = 5,
mask_token_id = mask_id,
unselectable_token_ids = [
cls_id,
sep_id,
pad_id,
]
)
masker(batch)