ViTDet backbones
- 원본 링크 : https://keras.io/api/keras_cv/models/backbones/vitdet/
- 최종 확인 : 2024-11-25
ViTDetBackbone class
keras_cv.models.ViTDetBackbone(
include_rescaling,
input_shape=(1024, 1024, 3),
input_tensor=None,
patch_size=16,
embed_dim=768,
depth=12,
mlp_dim=3072,
num_heads=12,
out_chans=256,
use_bias=True,
use_abs_pos=True,
use_rel_pos=True,
window_size=14,
global_attention_indices=[2, 5, 8, 11],
layer_norm_epsilon=1e-06,
**kwargs
)A ViT image encoder that uses a windowed transformer encoder and relative positional encodings.
Arguments
- input_shape (tuple[int], optional): The size of the input image in
(H, W, C)format. Defaults to(1024, 1024, 3). - input_tensor (KerasTensor, optional): Output of
keras.layers.Input()) to use as image input for the model. Defaults toNone. - include_rescaling (bool, optional): Whether to rescale the inputs. If
set to
True, inputs will be passed through aRescaling(1/255.0)layer. Defaults toFalse. - patch_size (int, optional): the patch size to be supplied to the
Patching layer to turn input images into a flattened sequence of
patches. Defaults to
16. - embed_dim (int, optional): The latent dimensionality to be projected
into in the output of each stacked windowed transformer encoder.
Defaults to
768. - depth (int, optional): The number of transformer encoder layers to
stack in the Vision Transformer. Defaults to
12. - mlp_dim (int, optional): The dimensionality of the hidden Dense
layer in the transformer MLP head. Defaults to
768*4. - num_heads (int, optional): the number of heads to use in the
MultiHeadAttentionWithRelativePElayer of each transformer encoder. Defaults to12. - out_chans (int, optional): The number of channels (features) in the
output (image encodings). Defaults to
256. - use_bias (bool, optional): Whether to use bias to project the keys,
queries, and values in the attention layer. Defaults to
True. - use_abs_pos (bool, optional): Whether to add absolute positional
embeddings to the output patches. Defaults to
True. - use_rel_pos (bool, optional): Whether to use relative positional
emcodings in the attention layer. Defaults to
True. - window_size (int, optional): The size of the window for windowed
attention in the transformer encoder blocks. Defaults to
14. - global_attention_indices (list, optional): Indexes for blocks using
global attention. Defaults to
[2, 5, 8, 11]. - layer_norm_epsilon (int, optional): The epsilon to use in the layer
normalization blocks in transformer encoder. Defaults to
1e-6.
References
from_preset method
ViTDetBackbone.from_preset()Instantiate ViTDetBackbone model from preset config and weights.
Arguments
- preset: string. Must be one of “vitdet_base”, “vitdet_large”, “vitdet_huge”, “vitdet_base_sa1b”, “vitdet_large_sa1b”, “vitdet_huge_sa1b”. If looking for a preset with pretrained weights, choose one of “vitdet_base_sa1b”, “vitdet_large_sa1b”, “vitdet_huge_sa1b”.
- load_weights: Whether to load pre-trained weights into model.
Defaults to
None, which follows whether the preset has pretrained weights available.
Examples
# Load architecture and weights from preset
model = keras_cv.models.ViTDetBackbone.from_preset(
"vitdet_base_sa1b",
)
# Load randomly initialized model from preset architecture with weights
model = keras_cv.models.ViTDetBackbone.from_preset(
"vitdet_base_sa1b",
load_weights=False,| Preset name | Parameters | Description |
|---|---|---|
| vitdet_base | 89.67M | Detectron2 ViT basebone with 12 transformer encoders with embed dim 768 and attention layers with 12 heads with global attention on encoders 2, 5, 8, and 11. |
| vitdet_large | 308.28M | Detectron2 ViT basebone with 24 transformer encoders with embed dim 1024 and attention layers with 16 heads with global attention on encoders 5, 11, 17, and 23. |
| vitdet_huge | 637.03M | Detectron2 ViT basebone model with 32 transformer encoders with embed dim 1280 and attention layers with 16 heads with global attention on encoders 7, 15, 23, and 31. |
| vitdet_base_sa1b | 89.67M | A base Detectron2 ViT backbone trained on the SA1B dataset. |
| vitdet_large_sa1b | 308.28M | A large Detectron2 ViT backbone trained on the SA1B dataset. |
| vitdet_huge_sa1b | 637.03M | A huge Detectron2 ViT backbone trained on the SA1B dataset. |
ViTDetBBackbone class
keras_cv.models.ViTDetBBackbone(
include_rescaling,
input_shape=(1024, 1024, 3),
input_tensor=None,
patch_size=16,
embed_dim=768,
depth=12,
mlp_dim=3072,
num_heads=12,
out_chans=256,
use_bias=True,
use_abs_pos=True,
use_rel_pos=True,
window_size=14,
global_attention_indices=[2, 5, 8, 11],
layer_norm_epsilon=1e-06,
**kwargs
)VitDetBBackbone model.
Reference
For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.
Example
input_data = np.ones(shape=(1, 1024, 1024, 3))
# Randomly initialized backbone
model = VitDetBBackbone()
output = model(input_data)ViTDetLBackbone class
keras_cv.models.ViTDetLBackbone(
include_rescaling,
input_shape=(1024, 1024, 3),
input_tensor=None,
patch_size=16,
embed_dim=768,
depth=12,
mlp_dim=3072,
num_heads=12,
out_chans=256,
use_bias=True,
use_abs_pos=True,
use_rel_pos=True,
window_size=14,
global_attention_indices=[2, 5, 8, 11],
layer_norm_epsilon=1e-06,
**kwargs
)VitDetLBackbone model.
Reference
For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.
Example
input_data = np.ones(shape=(1, 1024, 1024, 3))
# Randomly initialized backbone
model = VitDetLBackbone()
output = model(input_data)ViTDetHBackbone class
keras_cv.models.ViTDetHBackbone(
include_rescaling,
input_shape=(1024, 1024, 3),
input_tensor=None,
patch_size=16,
embed_dim=768,
depth=12,
mlp_dim=3072,
num_heads=12,
out_chans=256,
use_bias=True,
use_abs_pos=True,
use_rel_pos=True,
window_size=14,
global_attention_indices=[2, 5, 8, 11],
layer_norm_epsilon=1e-06,
**kwargs
)VitDetHBackbone model.
Reference
For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.
Example
input_data = np.ones(shape=(1, 1024, 1024, 3))
# Randomly initialized backbone
model = VitDetHBackbone()
output = model(input_data)