AdditiveAttention layer
- Original Link : https://keras.io/api/layers/attention_layers/additive_attention/
- Last Checked at : 2024-11-25
AdditiveAttention
class
keras.layers.AdditiveAttention(use_scale=True, dropout=0.0, **kwargs)
Additive attention layer, a.k.a. Bahdanau-style attention.
Inputs are a list with 2 or 3 elements: 1. A query
tensor of shape (batch_size, Tq, dim)
. 2. A value
tensor of shape (batch_size, Tv, dim)
. 3. A optional key
tensor of shape (batch_size, Tv, dim)
. If none supplied, value
will be used as key
.
The calculation follows the steps: 1. Calculate attention scores using query
and key
with shape (batch_size, Tq, Tv)
as a non-linear sum scores = reduce_sum(tanh(query + key), axis=-1)
. 2. Use scores to calculate a softmax distribution with shape (batch_size, Tq, Tv)
. 3. Use the softmax distribution to create a linear combination of value
with shape (batch_size, Tq, dim)
.
Arguments
- use_scale: If
True
, will create a scalar variable to scale the attention scores. - dropout: Float between 0 and 1. Fraction of the units to drop for the attention scores. Defaults to
0.0
.
Call arguments
- inputs: List of the following tensors:
query
: Query tensor of shape(batch_size, Tq, dim)
.value
: Value tensor of shape(batch_size, Tv, dim)
.key
: Optional key tensor of shape(batch_size, Tv, dim)
. If not given, will usevalue
for bothkey
andvalue
, which is the most common case.
- mask: List of the following tensors:
query_mask
: A boolean mask tensor of shape(batch_size, Tq)
. If given, the output will be zero at the positions wheremask==False
.value_mask
: A boolean mask tensor of shape(batch_size, Tv)
. If given, will apply the mask such that values at positions wheremask==False
do not contribute to the result.
- return_attention_scores: bool, it
True
, returns the attention scores (after masking and softmax) as an additional output argument. - training: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout).
- use_causal_mask: Boolean. Set to
True
for decoder self-attention. Adds a mask such that positioni
cannot attend to positionsj > i
. This prevents the flow of information from the future towards the past. Defaults toFalse
.
Output: Attention outputs of shape (batch_size, Tq, dim)
. (Optional) Attention scores after masking and softmax with shape (batch_size, Tq, Tv)
.