CLIP · ViT-L/14 · Foundation Model

EDA Dashboard

Exploratory analysis of the multimodal corpus driving the CLIP and BLIP-2 pre-training pipeline.

⌘K

GPU78%

Loss0.184

Throughput312/s

All charts and tables are exportable for research reports.

Dataset overview

Metric	Value	Note
Total image-text pairs	5,850,000,000	LAION-5B
Unique English captions	2,320,000,000	en subset
Multilingual subset	2,260,000,000	100+ languages
Average caption length	11.4 tokens	median 9
Distinct classes (zero-shot)	21,841	ImageNet-21k mapping
Missing alt-text pairs	1.7%	filtered
Duplicate near-images	3.4%	phash dedupe
Resolution ≥ 512px	62.1%	high-res subset

Scene type distribution

Nature (28.0%)

Urban (19.0%)

People (17.0%)

Objects (14.0%)

Animals (9.0%)

Food (7.0%)

Other (6.0%)

Caption length distribution

Language distribution

en — English (39.0%)

zh — Chinese (12.0%)

es — Spanish (9.0%)

fr — French (7.0%)

de — German (6.0%)

ja — Japanese (5.0%)

other — Other languages (22.0%)

Most frequent caption tokens

theaofinwithonandimagephotopeopleviewcitynatureseabeachsky

Image resolution distribution

Brightness histogram

Correlation matrix · image stats

Bright

Contrast

Saturation

Edges

Entropy

Aspect

Bright

0.90

0.74

0.32

-0.15

-0.46

Contrast

0.74

0.90

0.74

0.32

-0.15

-0.46

Saturation

0.32

0.74

0.90

0.74

0.32

-0.15

Edges

-0.15

0.32

0.74

0.90

0.74

0.32

Entropy

-0.46

-0.15

0.32

0.74

0.90

0.74

Aspect

-0.46

-0.15

0.32

0.74

0.90

t-SNE embedding clusters · 320 sampled vectors

Training dynamics · 30 epochs

Train loss Val loss Top-1 acc GPU util %