TransformerDecoder layer
- Original Link : https://keras.io/api/keras_nlp/modeling_layers/transformer_decoder/
- Last Checked at : 2024-11-26
TransformerDecoder
class
keras_nlp.layers.TransformerDecoder(
intermediate_dim,
num_heads,
dropout=0,
activation="relu",
layer_norm_epsilon=1e-05,
kernel_initializer="glorot_uniform",
bias_initializer="zeros",
normalize_first=False,
**kwargs
)
Transformer decoder.
This class follows the architecture of the transformer decoder layer in the paper Attention is All You Need. Users can instantiate multiple instances of this class to stack up a decoder.
By default, this layer will apply a causal mask to the decoder attention
layer. You can also pass padding or attention masks directly to the layer
during call, e.g. with decoder_padding_mask
or decoder_attention_mask
.
This layer can be called with either one or two inputs. The number of inputs
must be consistent across all calls. The options are as follows:
layer(decoder_sequence)
: no cross-attention will be built into the
decoder block. This is useful when building a “decoder-only”
transformer such as GPT-2.
layer(decoder_sequence, encoder_sequence)
: cross-attention will be
built into the decoder block. This is useful when building an
“encoder-decoder” transformer, such as the original transformer
model described in Attention is All You Need.
Arguments
- intermediate_dim: int, the hidden size of feedforward network.
- num_heads: int, the number of heads in MultiHeadAttention.
- dropout: float. the dropout value, shared by
MultiHeadAttention and feedforward network. Defaults to
0.
. - activation: string or
keras.activations
. the activation function of feedforward network. Defaults to"relu"
. - layer_norm_epsilon: float. The eps value in layer
normalization components. Defaults to
1e-5
. - kernel_initializer: string or
keras.initializers
initializer. The kernel initializer for the dense and multiheaded attention layers. Defaults to"glorot_uniform"
. - bias_initializer: string or
keras.initializers
initializer. The bias initializer for the dense and multiheaded attention layers. Defaults to"zeros"
. - normalize_first: bool. If True, the inputs to the
attention layer(s) and the intermediate dense layer are normalized
(similar to GPT-2). If set to False, outputs of attention layer and
intermediate dense layer are normalized (similar to BERT).
Defaults to
False
. - **kwargs: other keyword arguments passed to
keras.layers.Layer
, includingname
,trainable
,dtype
etc.
Example
# Create a single transformer decoder layer.
decoder = keras_hub.layers.TransformerDecoder(
intermediate_dim=64, num_heads=8)
# Create a simple model containing the decoder.
decoder_input = keras.Input(shape=(10, 64))
encoder_input = keras.Input(shape=(10, 64))
output = decoder(decoder_input, encoder_input)
model = keras.Model(
inputs=(decoder_input, encoder_input),
outputs=output,
)
# Call decoder on the inputs.
decoder_input_data = np.random.uniform(size=(2, 10, 64))
encoder_input_data = np.random.uniform(size=(2, 10, 64))
decoder_output = model((decoder_input_data, encoder_input_data))
References
call
method
TransformerDecoder.call(
decoder_sequence,
encoder_sequence=None,
decoder_padding_mask=None,
decoder_attention_mask=None,
encoder_padding_mask=None,
encoder_attention_mask=None,
self_attention_cache=None,
self_attention_cache_update_index=None,
cross_attention_cache=None,
cross_attention_cache_update_index=None,
use_causal_mask=True,
training=None,
)
Forward pass of the TransformerDecoder.
Arguments
- decoder_sequence: a Tensor. The decoder input sequence.
- encoder_sequence: a Tensor. The encoder input sequence. For decoder
only models (like GPT2), this should be left
None
. Once the model is called once without an encoder_sequence, you cannot call it again with encoder_sequence. - decoder_padding_mask: a boolean Tensor, the padding mask of decoder
sequence, must be of shape
[batch_size, decoder_sequence_length]
. - decoder_attention_mask: a boolean Tensor. Customized decoder
sequence mask, must be of shape
[batch_size, decoder_sequence_length, decoder_sequence_length]
. - encoder_padding_mask: a boolean Tensor, the padding mask of encoder
sequence, must be of shape
[batch_size, encoder_sequence_length]
. - encoder_attention_mask: a boolean Tensor. Customized encoder
sequence mask, must be of shape
[batch_size, encoder_sequence_length, encoder_sequence_length]
. - self_attention_cache: a dense float Tensor. The cache of key/values
pairs in the self-attention layer. Has shape
[batch_size, 2, max_seq_len, num_heads, key_dims]
. - self_attention_cache_update_index: an int or int Tensor, the index
at which to update the
self_attention_cache
. Usually, this is the index of the current token being processed during decoding. - cross_attention_cache: a dense float Tensor. The cache of
key/value pairs in the cross-attention layer. Has shape
[batch_size, 2, S, num_heads, key_dims]
. - cross_attention_cache_update_index: an int or int Tensor, the index
at which to update the
cross_attention_cache
. Usually, this is either0
(compute the entirecross_attention_cache
), orNone
(reuse a previously computedcross_attention_cache
). - use_causal_mask: bool, defaults to
True
. If true, a causal mask (masking out future input) is applied `on the decoder sequence. - training: a boolean indicating whether the layer should behave in training mode or in inference mode.
Returns
- One of three things, depending on call arguments:
outputs
, ifself_attention_cache
is `None.(outputs, self_attention_cache)
, ifself_attention_cache
is set and the layer has no cross-attention.(outputs, self_attention_cache, cross_attention_cache)
, ifself_attention_cache
andcross_attention_cache
are set and the layer has cross-attention.