CLIP · ViT-L/14 · Foundation Model

Multimodal Foundation Model Overview

CLIP ViT-L/14 trained conceptually on LAION-5B · cross-modal alignment, zero-shot transfer & retrieval at scale.

Foundation model · CLIP · 428M params

Cross-modal understanding, at billion-pair scale.

A unified representation space where pixels and language collapse into one geometry — enabling retrieval, zero-shot classification, captioning and rapid transfer to medical, fashion, agriculture and wildlife domains.

5.85B
image-text pairs
428M
ViT-L/14 params
76.2%
ImageNet zero-shot
Live training
streaming
0.184
contrastive loss · step 184k
Throughput
312/s
GPU util
78%
ETA
04:12:48
Top-1 Accuracy
76.2%
1.40% vs last epoch
F1 Score
0.831
0.60% vs last epoch
Embeddings Indexed
48.2M
3.10% vs last epoch
Inference Latency
38ms
2.40% vs last epoch
Capability profile
ViT-L/14 vs ViT-B/32
ViT-L/14ViT-B/32
Quick actions
Jump into a module
Active pipeline
Tokenize
Encode
Project
Contrastive
Index