CLIP · ViT-L/14 · Foundation Model

EDA Dashboard

Exploratory analysis of the multimodal corpus driving the CLIP and BLIP-2 pre-training pipeline.

All charts and tables are exportable for research reports.

Dataset overview
MetricValueNote
Total image-text pairs5,850,000,000LAION-5B
Unique English captions2,320,000,000en subset
Multilingual subset2,260,000,000100+ languages
Average caption length11.4 tokensmedian 9
Distinct classes (zero-shot)21,841ImageNet-21k mapping
Missing alt-text pairs1.7%filtered
Duplicate near-images3.4%phash dedupe
Resolution ≥ 512px62.1%high-res subset
Scene type distribution
Caption length distribution
Language distribution
Most frequent caption tokens
theaofinwithonandimagephotopeopleviewcitynatureseabeachsky
Image resolution distribution
Brightness histogram
Correlation matrix · image stats
Bright
Contrast
Saturation
Edges
Entropy
Aspect
Bright
0.90
0.74
0.32
-0.15
-0.46
-0.46
Contrast
0.74
0.90
0.74
0.32
-0.15
-0.46
Saturation
0.32
0.74
0.90
0.74
0.32
-0.15
Edges
-0.15
0.32
0.74
0.90
0.74
0.32
Entropy
-0.46
-0.15
0.32
0.74
0.90
0.74
Aspect
-0.46
-0.46
-0.15
0.32
0.74
0.90
t-SNE embedding clusters · 320 sampled vectors
Training dynamics · 30 epochs
Train loss Val loss Top-1 acc GPU util %