CLIP · ViT-L/14 · Foundation Model

CLIP Foundation Architecture

Dual-encoder contrastive design: an image transformer and a text transformer projected into a shared 768-D latent space.

Image · 224×224Vision TransformerViT-L/14 · 304M paramsImage Projection"a futuristic city"Text Transformer63M · BPE 49kText ProjectionShared 768-D EmbeddingInfoNCE contrastive alignment
Vision Encoder
ViT-L/14 patches 16×16, 24 layers, 16 heads, 1024 width.
Text Encoder
12-layer GPT-style transformer, 512 width, 8 heads, 77 tokens.
Projection
Linear projection to a 768-D L2-normalized shared latent space.
Objective
Symmetric InfoNCE with learned temperature τ over N×N similarity matrix.
Attention map (illustrative)
Layer 18 · head 7 — token "skyline"