CLIP · ViT-L/14 · Foundation Model
Model Comparison
Benchmark vision foundation models across retrieval, captioning, embedding quality, latency and memory.
⌘K
GPU78%
Loss0.184
Throughput312/s
Zero-shot accuracy (ImageNet-1k)
Captioning quality (COCO)
Latency vs memory
SOTA progression 2018 → 2023
Full benchmark table
| Metric | Value | Note |
|---|---|---|
| CLIP ViT-B/32 · Params | 151M | |
| CLIP ViT-B/32 · Embed dim | 512 | |
| CLIP ViT-B/32 · ImageNet Top-1 | 63.2% | |
| CLIP ViT-B/32 · ImageNet Top-5 | 88.7% | |
| CLIP ViT-B/32 · COCO BLEU-4 | 24.5 | |
| CLIP ViT-B/32 · COCO CIDEr | 67.1 | |
| CLIP ViT-B/32 · Inference (ms) | 11 | |
| CLIP ViT-B/32 · VRAM (GB) | 0.6 | |
| CLIP ViT-B/32 · Embedding quality | 0.78 | |
| CLIP ViT-L/14 · Params | 428M | |
| CLIP ViT-L/14 · Embed dim | 768 | |
| CLIP ViT-L/14 · ImageNet Top-1 | 75.5% | |
| CLIP ViT-L/14 · ImageNet Top-5 | 94.4% | |
| CLIP ViT-L/14 · COCO BLEU-4 | 28.2 | |
| CLIP ViT-L/14 · COCO CIDEr | 76.3 | |
| CLIP ViT-L/14 · Inference (ms) | 28 | |
| CLIP ViT-L/14 · VRAM (GB) | 1.7 | |
| CLIP ViT-L/14 · Embedding quality | 0.87 | |
| BLIP · Params | 224M | |
| BLIP · Embed dim | 768 | |
| BLIP · ImageNet Top-1 | 70.1% | |
| BLIP · ImageNet Top-5 | 91.3% | |
| BLIP · COCO BLEU-4 | 35.4 | |
| BLIP · COCO CIDEr | 121.5 | |
| BLIP · Inference (ms) | 32 | |
| BLIP · VRAM (GB) | 1.1 | |
| BLIP · Embedding quality | 0.83 | |
| BLIP-2 · Params | 1.2B | |
| BLIP-2 · Embed dim | 768 | |
| BLIP-2 · ImageNet Top-1 | 78.6% | |
| BLIP-2 · ImageNet Top-5 | 95.9% | |
| BLIP-2 · COCO BLEU-4 | 40.8 | |
| BLIP-2 · COCO CIDEr | 145.8 | |
| BLIP-2 · Inference (ms) | 78 | |
| BLIP-2 · VRAM (GB) | 4.3 | |
| BLIP-2 · Embedding quality | 0.92 |