KerasCV Models

KerasCV contains end-to-end implementations of popular model architectures. These models can be created in two ways:

  • Through the from_preset() constructor, which instantiates an object with a pre-trained configuration, and (optionally) weights. Available preset names are listed on this page.
model = keras_cv.models.RetinaNet.from_preset(
    "resnet50_v2_imagenet",
    num_classes=20,
    bounding_box_format="xywh",
)
  • Through custom configuration controlled by the user. To do this, simply pass the desired configuration parameters to the default constructors of the symbols documented below.
backbone = keras_cv.models.ResNetBackbone(
    stackwise_filters=[64, 128, 256, 512],
    stackwise_blocks=[2, 2, 2, 2],
    stackwise_strides=[1, 2, 2, 2],
    include_rescaling=False,
)
model = keras_cv.models.RetinaNet(
    backbone=backbone,
    num_classes=20,
    bounding_box_format="xywh",
)

Backbone presets

Each of the following preset name corresponds to a configuration and weights for a backbone model.

The names below can be used with the from_preset() constructor for the corresponding backbone model.

backbone = keras_cv.models.ResNetBackbone.from_preset("resnet50_imagenet")

For brevity, we do not include the presets without pretrained weights in the following table.

Note: All pretrained weights should be used with unnormalized pixel intensities in the range [0, 255] if include_rescaling=True or in the range [0, 1] if including_rescaling=False.

Preset nameModelParametersDescription
csp_darknet_l_imagenetCSPDarkNet27.11MCSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tiny_imagenetCSPDarkNet2.38MCSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tinyCSPDarkNet2.38MCSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_sCSPDarkNet4.22MCSPDarkNet model with [64, 128, 256, 512] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_mCSPDarkNet12.37MCSPDarkNet model with [96, 192, 384, 768] channels and [2, 6, 6, 2] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_lCSPDarkNet27.11MCSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_xlCSPDarkNet56.84MCSPDarkNet model with [170, 340, 680, 1360] channels and [4, 12, 12, 4] depths where the batch normalization and SiLU activation are applied after the convolution layers.
densenet121_imagenetUnknownUnknownDenseNet model with 121 layers. Trained on Imagenet 2012 classification task.
densenet169_imagenetUnknownUnknownDenseNet model with 169 layers. Trained on Imagenet 2012 classification task.
densenet201_imagenetUnknownUnknownDenseNet model with 201 layers. Trained on Imagenet 2012 classification task.
densenet121UnknownUnknownDenseNet model with 121 layers.
densenet169UnknownUnknownDenseNet model with 169 layers.
densenet201UnknownUnknownDenseNet model with 201 layers.
efficientnetlite_b03.41MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0.
efficientnetlite_b14.19MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1.
efficientnetlite_b24.87MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2.
efficientnetlite_b36.99MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4.
efficientnetlite_b411.84MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.4 and depth_coefficient=1.8.
efficientnetv1_b0EfficientNetV14.05MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0.
efficientnetv1_b1EfficientNetV16.58MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1.
efficientnetv1_b2EfficientNetV17.77MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2.
efficientnetv1_b3EfficientNetV110.79MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4.
efficientnetv1_b4EfficientNetV117.68MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.4 and depth_coefficient=1.8.
efficientnetv1_b5EfficientNetV128.52MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.6 and depth_coefficient=2.2.
efficientnetv1_b6EfficientNetV140.97MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.8 and depth_coefficient=2.6.
efficientnetv1_b7EfficientNetV164.11MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=2.0 and depth_coefficient=3.1.
efficientnetv2_b0_imagenetEfficientNetV25.92MEfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0. Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet.
efficientnetv2_b1_imagenetEfficientNetV26.93MEfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet.
efficientnetv2_b2_imagenetEfficientNetV28.77MEfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet.
efficientnetv2_s_imagenetEfficientNetV220.33MEfficientNet architecture with 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9%top 1 accuracy and 96.7% top 5 accuracy on imagenet.
efficientnetv2_sEfficientNetV220.33MEfficientNet architecture with 6 convolutional blocks.
efficientnetv2_mEfficientNetV253.15MEfficientNet architecture with 7 convolutional blocks.
efficientnetv2_lEfficientNetV2117.75MEfficientNet architecture with 7 convolutional blocks, but more filters the in efficientnetv2_m.
efficientnetv2_b0EfficientNetV25.92MEfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0.
efficientnetv2_b1EfficientNetV26.93MEfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1.
efficientnetv2_b2EfficientNetV28.77MEfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2.
efficientnetv2_b3EfficientNetV212.93MEfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4.
mit_b0_imagenetMiT3.32MMiT (MixTransformer) model with 8 transformer blocks. Pre-trained on ImageNet-1K and scores 69% top-1 accuracy on the validation set.
mit_b0MiT3.32MMiT (MixTransformer) model with 8 transformer blocks.
mit_b1MiT13.16MMiT (MixTransformer) model with 8 transformer blocks.
mit_b2MiT24.20MMiT (MixTransformer) model with 16 transformer blocks.
mit_b3MiT44.08MMiT (MixTransformer) model with 28 transformer blocks.
mit_b4MiT60.85MMiT (MixTransformer) model with 41 transformer blocks.
mit_b5MiT81.45MMiT (MixTransformer) model with 52 transformer blocks.
mobilenet_v3_large_imagenetMobileNetV32.99MMobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_small_imagenetMobileNetV3933.50KMobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_smallMobileNetV3933.50KMobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers.
mobilenet_v3_largeMobileNetV32.99MMobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers.
resnet50_imagenetResNetV123.56MResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). Trained on Imagenet 2012 classification task.
resnet18ResNetV111.19MResNet model with 18 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet34ResNetV121.30MResNet model with 34 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet50ResNetV123.56MResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet101ResNetV142.61MResNet model with 101 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet152ResNetV158.30MResNet model with 152 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet50_v2_imagenetResNetV223.56MResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task.
resnet18_v2ResNetV211.18MResNet model with 18 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet34_v2ResNetV221.30MResNet model with 34 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet50_v2ResNetV223.56MResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet101_v2ResNetV242.63MResNet model with 101 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet152_v2ResNetV258.33MResNet model with 152 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
videoswin_base_kinetics400VideoSwinB87.64MA base Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset
videoswin_small_kinetics400VideoSwinS49.51MA small Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset
videoswin_tiny_kinetics400VideoSwinT27.85MA tiny Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset.
videoswin_tinyVideoSwinT27.85MA tiny Video Swin backbone architecture.
videoswin_smallVideoSwinS49.51MA small Video Swin backbone architecture.
videoswin_baseVideoSwinB87.64MA base Video Swin backbone architecture.
videoswin_base_kinetics400_imagenet22kVideoSwinB87.64MA base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 82.7% top1 and 95.5% top5 accuracy on the Kinetics 400 dataset
videoswin_base_kinetics600_imagenet22kVideoSwinB87.64MA base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 600 dataset. Published weight is capable of scoring 84.0% top1 and 96.5% top5 accuracy on the Kinetics 600 dataset
videoswin_base_something_something_v2VideoSwinB87.64MA base Video Swin backbone architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset
vitdet_base_sa1bVitDet89.67MA base Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_huge_sa1bVitDet637.03MA huge Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_large_sa1bVitDet308.28MA large Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_baseVitDet89.67MDetectron2 ViT basebone with 12 transformer encoders with embed dim 768 and attention layers with 12 heads with global attention on encoders 2, 5, 8, and 11.
vitdet_largeVitDet308.28MDetectron2 ViT basebone with 24 transformer encoders with embed dim 1024 and attention layers with 16 heads with global attention on encoders 5, 11, 17, and 23.
vitdet_hugeVitDet637.03MDetectron2 ViT basebone model with 32 transformer encoders with embed dim 1280 and attention layers with 16 heads with global attention on encoders 7, 15, 23, and 31.
yolo_v8_xs_backboneYOLOV81.28MAn extra small YOLOV8 backbone
yolo_v8_s_backboneYOLOV85.09MA small YOLOV8 backbone
yolo_v8_m_backboneYOLOV811.87MA medium YOLOV8 backbone
yolo_v8_l_backboneYOLOV819.83MA large YOLOV8 backbone
yolo_v8_xl_backboneYOLOV830.97MAn extra large YOLOV8 backbone
yolo_v8_xs_backbone_cocoYOLOV81.28MAn extra small YOLOV8 backbone pretrained on COCO
yolo_v8_s_backbone_cocoYOLOV85.09MA small YOLOV8 backbone pretrained on COCO
yolo_v8_m_backbone_cocoYOLOV811.87MA medium YOLOV8 backbone pretrained on COCO
yolo_v8_l_backbone_cocoYOLOV819.83MA large YOLOV8 backbone pretrained on COCO
yolo_v8_xl_backbone_cocoYOLOV830.97MAn extra large YOLOV8 backbone pretrained on COCO
center_pillar_waymo_open_datasetUnknown1.28MAn example CenterPillar backbone for WOD.

Task presets

Each of the following preset name corresponds to a configuration and weights for a task model. These models are application-ready, but can be further fine-tuned if desired.

The names below can be used with the from_preset() constructor for the corresponding task models.

object_detector = keras_cv.models.RetinaNet.from_preset(
    "retinanet_resnet50_pascalvoc",
    bounding_box_format="xywh",
)

Note that all backbone presets are also applicable to the tasks. For example, you can directly use a ResNetBackbone preset with the RetinaNet. In this case, fine-tuning is necessary since task-specific layers will be randomly initialized.

backbone = keras_cv.models.RetinaNet.from_preset(
    "resnet50_imagenet",
    bounding_box_format="xywh",
)

For brevity, we do not include the backbone presets in the following table.

Note: All pretrained weights should be used with unnormalized pixel intensities in the range [0, 255] if include_rescaling=True or in the range [0, 1] if including_rescaling=False.

{{task_presets_table}}

API Documentation

Tasks

Backbones