Models

KerasCV Models

Original Link : https://keras.io/api/keras_cv/models/
Last Checked at : 2024-11-25

KerasCV contains end-to-end implementations of popular model architectures. These models can be created in two ways:

Through the from_preset() constructor, which instantiates an object with a pre-trained configuration, and (optionally) weights. Available preset names are listed on this page.

model = keras_cv.models.RetinaNet.from_preset(
    "resnet50_v2_imagenet",
    num_classes=20,
    bounding_box_format="xywh",
)

Through custom configuration controlled by the user. To do this, simply pass the desired configuration parameters to the default constructors of the symbols documented below.

backbone = keras_cv.models.ResNetBackbone(
    stackwise_filters=[64, 128, 256, 512],
    stackwise_blocks=[2, 2, 2, 2],
    stackwise_strides=[1, 2, 2, 2],
    include_rescaling=False,
)
model = keras_cv.models.RetinaNet(
    backbone=backbone,
    num_classes=20,
    bounding_box_format="xywh",
)

Backbone presets

Each of the following preset name corresponds to a configuration and weights for a backbone model.

The names below can be used with the from_preset() constructor for the corresponding backbone model.

backbone = keras_cv.models.ResNetBackbone.from_preset("resnet50_imagenet")

For brevity, we do not include the presets without pretrained weights in the following table.

Note: All pretrained weights should be used with unnormalized pixel intensities in the range [0, 255] if include_rescaling=True or in the range [0, 1] if including_rescaling=False.

Preset name	Model	Parameters	Description
csp_darknet_l_imagenet	CSPDarkNet	27.11M	CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tiny_imagenet	CSPDarkNet	2.38M	CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tiny	CSPDarkNet	2.38M	CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_s	CSPDarkNet	4.22M	CSPDarkNet model with [64, 128, 256, 512] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_m	CSPDarkNet	12.37M	CSPDarkNet model with [96, 192, 384, 768] channels and [2, 6, 6, 2] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_l	CSPDarkNet	27.11M	CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_xl	CSPDarkNet	56.84M	CSPDarkNet model with [170, 340, 680, 1360] channels and [4, 12, 12, 4] depths where the batch normalization and SiLU activation are applied after the convolution layers.
densenet121_imagenet	Unknown	Unknown	DenseNet model with 121 layers. Trained on Imagenet 2012 classification task.
densenet169_imagenet	Unknown	Unknown	DenseNet model with 169 layers. Trained on Imagenet 2012 classification task.
densenet201_imagenet	Unknown	Unknown	DenseNet model with 201 layers. Trained on Imagenet 2012 classification task.
densenet121	Unknown	Unknown	DenseNet model with 121 layers.
densenet169	Unknown	Unknown	DenseNet model with 169 layers.
densenet201	Unknown	Unknown	DenseNet model with 201 layers.
efficientnetlite_b0		3.41M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.0`.
efficientnetlite_b1		4.19M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.1`.
efficientnetlite_b2		4.87M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.1` and `depth_coefficient=1.2`.
efficientnetlite_b3		6.99M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.2` and `depth_coefficient=1.4`.
efficientnetlite_b4		11.84M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.4` and `depth_coefficient=1.8`.
efficientnetv1_b0	EfficientNetV1	4.05M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.0`.
efficientnetv1_b1	EfficientNetV1	6.58M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.1`.
efficientnetv1_b2	EfficientNetV1	7.77M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.1` and `depth_coefficient=1.2`.
efficientnetv1_b3	EfficientNetV1	10.79M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.2` and `depth_coefficient=1.4`.
efficientnetv1_b4	EfficientNetV1	17.68M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.4` and `depth_coefficient=1.8`.
efficientnetv1_b5	EfficientNetV1	28.52M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.6` and `depth_coefficient=2.2`.
efficientnetv1_b6	EfficientNetV1	40.97M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.8` and `depth_coefficient=2.6`.
efficientnetv1_b7	EfficientNetV1	64.11M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=2.0` and `depth_coefficient=3.1`.
efficientnetv2_b0_imagenet	EfficientNetV2	5.92M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.0`. Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet.
efficientnetv2_b1_imagenet	EfficientNetV2	6.93M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.1`. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet.
efficientnetv2_b2_imagenet	EfficientNetV2	8.77M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.1` and `depth_coefficient=1.2`. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet.
efficientnetv2_s_imagenet	EfficientNetV2	20.33M	EfficientNet architecture with 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9%top 1 accuracy and 96.7% top 5 accuracy on imagenet.
efficientnetv2_s	EfficientNetV2	20.33M	EfficientNet architecture with 6 convolutional blocks.
efficientnetv2_m	EfficientNetV2	53.15M	EfficientNet architecture with 7 convolutional blocks.
efficientnetv2_l	EfficientNetV2	117.75M	EfficientNet architecture with 7 convolutional blocks, but more filters the in `efficientnetv2_m`.
efficientnetv2_b0	EfficientNetV2	5.92M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.0`.
efficientnetv2_b1	EfficientNetV2	6.93M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.0` and `depth_coefficient=1.1`.
efficientnetv2_b2	EfficientNetV2	8.77M	EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has `width_coefficient=1.1` and `depth_coefficient=1.2`.
efficientnetv2_b3	EfficientNetV2	12.93M	EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has `width_coefficient=1.2` and `depth_coefficient=1.4`.
mit_b0_imagenet	MiT	3.32M	MiT (MixTransformer) model with 8 transformer blocks. Pre-trained on ImageNet-1K and scores 69% top-1 accuracy on the validation set.
mit_b0	MiT	3.32M	MiT (MixTransformer) model with 8 transformer blocks.
mit_b1	MiT	13.16M	MiT (MixTransformer) model with 8 transformer blocks.
mit_b2	MiT	24.20M	MiT (MixTransformer) model with 16 transformer blocks.
mit_b3	MiT	44.08M	MiT (MixTransformer) model with 28 transformer blocks.
mit_b4	MiT	60.85M	MiT (MixTransformer) model with 41 transformer blocks.
mit_b5	MiT	81.45M	MiT (MixTransformer) model with 52 transformer blocks.
mobilenet_v3_large_imagenet	MobileNetV3	2.99M	MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_small_imagenet	MobileNetV3	933.50K	MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_small	MobileNetV3	933.50K	MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers.
mobilenet_v3_large	MobileNetV3	2.99M	MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers.
resnet50_imagenet	ResNetV1	23.56M	ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). Trained on Imagenet 2012 classification task.
resnet18	ResNetV1	11.19M	ResNet model with 18 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet34	ResNetV1	21.30M	ResNet model with 34 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet50	ResNetV1	23.56M	ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet101	ResNetV1	42.61M	ResNet model with 101 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet152	ResNetV1	58.30M	ResNet model with 152 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet50_v2_imagenet	ResNetV2	23.56M	ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task.
resnet18_v2	ResNetV2	11.18M	ResNet model with 18 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet34_v2	ResNetV2	21.30M	ResNet model with 34 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet50_v2	ResNetV2	23.56M	ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet101_v2	ResNetV2	42.63M	ResNet model with 101 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet152_v2	ResNetV2	58.33M	ResNet model with 152 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
videoswin_base_kinetics400	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset
videoswin_small_kinetics400	VideoSwinS	49.51M	A small Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset
videoswin_tiny_kinetics400	VideoSwinT	27.85M	A tiny Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset.
videoswin_tiny	VideoSwinT	27.85M	A tiny Video Swin backbone architecture.
videoswin_small	VideoSwinS	49.51M	A small Video Swin backbone architecture.
videoswin_base	VideoSwinB	87.64M	A base Video Swin backbone architecture.
videoswin_base_kinetics400_imagenet22k	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 82.7% top1 and 95.5% top5 accuracy on the Kinetics 400 dataset
videoswin_base_kinetics600_imagenet22k	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 600 dataset. Published weight is capable of scoring 84.0% top1 and 96.5% top5 accuracy on the Kinetics 600 dataset
videoswin_base_something_something_v2	VideoSwinB	87.64M	A base Video Swin backbone architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset
vitdet_base_sa1b	VitDet	89.67M	A base Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_huge_sa1b	VitDet	637.03M	A huge Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_large_sa1b	VitDet	308.28M	A large Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_base	VitDet	89.67M	Detectron2 ViT basebone with 12 transformer encoders with embed dim 768 and attention layers with 12 heads with global attention on encoders 2, 5, 8, and 11.
vitdet_large	VitDet	308.28M	Detectron2 ViT basebone with 24 transformer encoders with embed dim 1024 and attention layers with 16 heads with global attention on encoders 5, 11, 17, and 23.
vitdet_huge	VitDet	637.03M	Detectron2 ViT basebone model with 32 transformer encoders with embed dim 1280 and attention layers with 16 heads with global attention on encoders 7, 15, 23, and 31.
yolo_v8_xs_backbone	YOLOV8	1.28M	An extra small YOLOV8 backbone
yolo_v8_s_backbone	YOLOV8	5.09M	A small YOLOV8 backbone
yolo_v8_m_backbone	YOLOV8	11.87M	A medium YOLOV8 backbone
yolo_v8_l_backbone	YOLOV8	19.83M	A large YOLOV8 backbone
yolo_v8_xl_backbone	YOLOV8	30.97M	An extra large YOLOV8 backbone
yolo_v8_xs_backbone_coco	YOLOV8	1.28M	An extra small YOLOV8 backbone pretrained on COCO
yolo_v8_s_backbone_coco	YOLOV8	5.09M	A small YOLOV8 backbone pretrained on COCO
yolo_v8_m_backbone_coco	YOLOV8	11.87M	A medium YOLOV8 backbone pretrained on COCO
yolo_v8_l_backbone_coco	YOLOV8	19.83M	A large YOLOV8 backbone pretrained on COCO
yolo_v8_xl_backbone_coco	YOLOV8	30.97M	An extra large YOLOV8 backbone pretrained on COCO
center_pillar_waymo_open_dataset	Unknown	1.28M	An example CenterPillar backbone for WOD.

Task presets

Each of the following preset name corresponds to a configuration and weights for a task model. These models are application-ready, but can be further fine-tuned if desired.

The names below can be used with the from_preset() constructor for the corresponding task models.

object_detector = keras_cv.models.RetinaNet.from_preset(
    "retinanet_resnet50_pascalvoc",
    bounding_box_format="xywh",
)

Note that all backbone presets are also applicable to the tasks. For example, you can directly use a ResNetBackbone preset with the RetinaNet. In this case, fine-tuning is necessary since task-specific layers will be randomly initialized.

backbone = keras_cv.models.RetinaNet.from_preset(
    "resnet50_imagenet",
    bounding_box_format="xywh",
)

For brevity, we do not include the backbone presets in the following table.

Note: All pretrained weights should be used with unnormalized pixel intensities in the range [0, 255] if include_rescaling=True or in the range [0, 1] if including_rescaling=False.

KerasCV Models

Backbone presets

Task presets

API Documentation

Tasks

Backbones