CLIP · ViT-L/14 · Foundation Model

Foundation Insights

Paste any inference trace — captions, retrieval rankings, zero-shot scores — and the model produces dynamic interpretability analysis.

Inference context
Generated analysis
Provide context and run to generate a per-inference interpretability report.
Foundation Models

Self-supervised models pretrained on broad data, adaptable to many downstream tasks with minimal supervision.

Multimodal Alignment

Contrastive InfoNCE loss collapses paired image–text vectors to nearby points on a unit sphere, enabling either direction of retrieval.

Failure Modes

Distribution shift, spurious correlation with caption co-occurrence, and prompt sensitivity dominate real-world errors.

Interpretability

Cosine geometry, attention maps and counterfactual probes reveal what the model attended to and why.

References
  1. 1. Radford et al. — Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021).
  2. 2. Schuhmann et al. — LAION-5B: An open large-scale dataset for training next-generation image-text models (NeurIPS, 2022).
  3. 3. Dosovitskiy et al. — An Image is Worth 16×16 Words (ViT, 2021).
  4. 4. Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (2022).
  5. 5. Bommasani et al. — On the Opportunities and Risks of Foundation Models (Stanford CRFM, 2021).