CLIP · ViT-L/14 · Foundation Model
EDA Dashboard
Exploratory analysis of the multimodal corpus driving the CLIP and BLIP-2 pre-training pipeline.
⌘K
GPU78%
Loss0.184
Throughput312/s
All charts and tables are exportable for research reports.
Dataset overview
| Metric | Value | Note |
|---|---|---|
| Total image-text pairs | 5,850,000,000 | LAION-5B |
| Unique English captions | 2,320,000,000 | en subset |
| Multilingual subset | 2,260,000,000 | 100+ languages |
| Average caption length | 11.4 tokens | median 9 |
| Distinct classes (zero-shot) | 21,841 | ImageNet-21k mapping |
| Missing alt-text pairs | 1.7% | filtered |
| Duplicate near-images | 3.4% | phash dedupe |
| Resolution ≥ 512px | 62.1% | high-res subset |
Scene type distribution
Caption length distribution
Language distribution
Most frequent caption tokens
theaofinwithonandimagephotopeopleviewcitynatureseabeachsky
Image resolution distribution
Brightness histogram
Correlation matrix · image stats
Bright
Contrast
Saturation
Edges
Entropy
Aspect
Bright
0.90
0.74
0.32
-0.15
-0.46
-0.46
Contrast
0.74
0.90
0.74
0.32
-0.15
-0.46
Saturation
0.32
0.74
0.90
0.74
0.32
-0.15
Edges
-0.15
0.32
0.74
0.90
0.74
0.32
Entropy
-0.46
-0.15
0.32
0.74
0.90
0.74
Aspect
-0.46
-0.46
-0.15
0.32
0.74
0.90
t-SNE embedding clusters · 320 sampled vectors
Training dynamics · 30 epochs
Train loss Val loss Top-1 acc GPU util %