CLIP · ViT-L/14 · Foundation Model
CLIP Foundation Architecture
Dual-encoder contrastive design: an image transformer and a text transformer projected into a shared 768-D latent space.
⌘K
GPU78%
Loss0.184
Throughput312/s
Vision Encoder
ViT-L/14 patches 16×16, 24 layers, 16 heads, 1024 width.
Text Encoder
12-layer GPT-style transformer, 512 width, 8 heads, 77 tokens.
Projection
Linear projection to a 768-D L2-normalized shared latent space.
Objective
Symmetric InfoNCE with learned temperature τ over N×N similarity matrix.
Attention map (illustrative)
Layer 18 · head 7 — token "skyline"