CLIP · ViT-L/14 · Foundation Model

CLIP Foundation Architecture

Dual-encoder contrastive design: an image transformer and a text transformer projected into a shared 768-D latent space.

⌘K

GPU78%

Loss0.184

Throughput312/s

Vision Encoder

ViT-L/14 patches 16×16, 24 layers, 16 heads, 1024 width.

Text Encoder

12-layer GPT-style transformer, 512 width, 8 heads, 77 tokens.

Projection

Linear projection to a 768-D L2-normalized shared latent space.

Objective

Symmetric InfoNCE with learned temperature τ over N×N similarity matrix.

Attention map (illustrative)

Layer 18 · head 7 — token "skyline"