ViTDet backbones

source

ViTDetBackbone class

keras_cv.models.ViTDetBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

A ViT image encoder that uses a windowed transformer encoder and relative positional encodings.

Arguments

  • input_shape (tuple[int], optional): The size of the input image in (H, W, C) format. Defaults to (1024, 1024, 3).
  • input_tensor (KerasTensor, optional): Output of keras.layers.Input()) to use as image input for the model. Defaults to None.
  • include_rescaling (bool, optional): Whether to rescale the inputs. If set to True, inputs will be passed through a Rescaling(1/255.0) layer. Defaults to False.
  • patch_size (int, optional): the patch size to be supplied to the Patching layer to turn input images into a flattened sequence of patches. Defaults to 16.
  • embed_dim (int, optional): The latent dimensionality to be projected into in the output of each stacked windowed transformer encoder. Defaults to 768.
  • depth (int, optional): The number of transformer encoder layers to stack in the Vision Transformer. Defaults to 12.
  • mlp_dim (int, optional): The dimensionality of the hidden Dense layer in the transformer MLP head. Defaults to 768*4.
  • num_heads (int, optional): the number of heads to use in the MultiHeadAttentionWithRelativePE layer of each transformer encoder. Defaults to 12.
  • out_chans (int, optional): The number of channels (features) in the output (image encodings). Defaults to 256.
  • use_bias (bool, optional): Whether to use bias to project the keys, queries, and values in the attention layer. Defaults to True.
  • use_abs_pos (bool, optional): Whether to add absolute positional embeddings to the output patches. Defaults to True.
  • use_rel_pos (bool, optional): Whether to use relative positional emcodings in the attention layer. Defaults to True.
  • window_size (int, optional): The size of the window for windowed attention in the transformer encoder blocks. Defaults to 14.
  • global_attention_indices (list, optional): Indexes for blocks using global attention. Defaults to [2, 5, 8, 11].
  • layer_norm_epsilon (int, optional): The epsilon to use in the layer normalization blocks in transformer encoder. Defaults to 1e-6.

References

source

from_preset method

ViTDetBackbone.from_preset()

Instantiate ViTDetBackbone model from preset config and weights.

Arguments

  • preset: string. Must be one of “vitdet_base”, “vitdet_large”, “vitdet_huge”, “vitdet_base_sa1b”, “vitdet_large_sa1b”, “vitdet_huge_sa1b”. If looking for a preset with pretrained weights, choose one of “vitdet_base_sa1b”, “vitdet_large_sa1b”, “vitdet_huge_sa1b”.
  • load_weights: Whether to load pre-trained weights into model. Defaults to None, which follows whether the preset has pretrained weights available.

Examples

# Load architecture and weights from preset
model = keras_cv.models.ViTDetBackbone.from_preset(
    "vitdet_base_sa1b",
)
# Load randomly initialized model from preset architecture with weights
model = keras_cv.models.ViTDetBackbone.from_preset(
    "vitdet_base_sa1b",
    load_weights=False,
Preset nameParametersDescription
vitdet_base89.67MDetectron2 ViT basebone with 12 transformer encoders with embed dim 768 and attention layers with 12 heads with global attention on encoders 2, 5, 8, and 11.
vitdet_large308.28MDetectron2 ViT basebone with 24 transformer encoders with embed dim 1024 and attention layers with 16 heads with global attention on encoders 5, 11, 17, and 23.
vitdet_huge637.03MDetectron2 ViT basebone model with 32 transformer encoders with embed dim 1280 and attention layers with 16 heads with global attention on encoders 7, 15, 23, and 31.
vitdet_base_sa1b89.67MA base Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_large_sa1b308.28MA large Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_huge_sa1b637.03MA huge Detectron2 ViT backbone trained on the SA1B dataset.

source

ViTDetBBackbone class

keras_cv.models.ViTDetBBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

VitDetBBackbone model.

Reference

For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.

Example

input_data = np.ones(shape=(1, 1024, 1024, 3))
# Randomly initialized backbone
model = VitDetBBackbone()
output = model(input_data)

source

ViTDetLBackbone class

keras_cv.models.ViTDetLBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

VitDetLBackbone model.

Reference

For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.

Example

input_data = np.ones(shape=(1, 1024, 1024, 3))
# Randomly initialized backbone
model = VitDetLBackbone()
output = model(input_data)

source

ViTDetHBackbone class

keras_cv.models.ViTDetHBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

VitDetHBackbone model.

Reference

For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.

Example

input_data = np.ones(shape=(1, 1024, 1024, 3))
# Randomly initialized backbone
model = VitDetHBackbone()
output = model(input_data)