CLIP · ViT-L/14 · Foundation Model

Model Comparison

Benchmark vision foundation models across retrieval, captioning, embedding quality, latency and memory.

Zero-shot accuracy (ImageNet-1k)
Captioning quality (COCO)
Latency vs memory
SOTA progression 2018 → 2023
Full benchmark table
MetricValueNote
CLIP ViT-B/32 · Params151M
CLIP ViT-B/32 · Embed dim512
CLIP ViT-B/32 · ImageNet Top-163.2%
CLIP ViT-B/32 · ImageNet Top-588.7%
CLIP ViT-B/32 · COCO BLEU-424.5
CLIP ViT-B/32 · COCO CIDEr67.1
CLIP ViT-B/32 · Inference (ms)11
CLIP ViT-B/32 · VRAM (GB)0.6
CLIP ViT-B/32 · Embedding quality0.78
CLIP ViT-L/14 · Params428M
CLIP ViT-L/14 · Embed dim768
CLIP ViT-L/14 · ImageNet Top-175.5%
CLIP ViT-L/14 · ImageNet Top-594.4%
CLIP ViT-L/14 · COCO BLEU-428.2
CLIP ViT-L/14 · COCO CIDEr76.3
CLIP ViT-L/14 · Inference (ms)28
CLIP ViT-L/14 · VRAM (GB)1.7
CLIP ViT-L/14 · Embedding quality0.87
BLIP · Params224M
BLIP · Embed dim768
BLIP · ImageNet Top-170.1%
BLIP · ImageNet Top-591.3%
BLIP · COCO BLEU-435.4
BLIP · COCO CIDEr121.5
BLIP · Inference (ms)32
BLIP · VRAM (GB)1.1
BLIP · Embedding quality0.83
BLIP-2 · Params1.2B
BLIP-2 · Embed dim768
BLIP-2 · ImageNet Top-178.6%
BLIP-2 · ImageNet Top-595.9%
BLIP-2 · COCO BLEU-440.8
BLIP-2 · COCO CIDEr145.8
BLIP-2 · Inference (ms)78
BLIP-2 · VRAM (GB)4.3
BLIP-2 · Embedding quality0.92