CLIP · ViT-L/14 · Foundation Model

Foundation Insights

Paste any inference trace — captions, retrieval rankings, zero-shot scores — and the model produces dynamic interpretability analysis.

⌘K

GPU78%

Loss0.184

Throughput312/s

Inference context

Generated analysis

Provide context and run to generate a per-inference interpretability report.

Foundation Models

Self-supervised models pretrained on broad data, adaptable to many downstream tasks with minimal supervision.

Multimodal Alignment

Contrastive InfoNCE loss collapses paired image–text vectors to nearby points on a unit sphere, enabling either direction of retrieval.

Failure Modes

Distribution shift, spurious correlation with caption co-occurrence, and prompt sensitivity dominate real-world errors.

Interpretability

Cosine geometry, attention maps and counterfactual probes reveal what the model attended to and why.

References

1. Radford et al. — Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021).
2. Schuhmann et al. — LAION-5B: An open large-scale dataset for training next-generation image-text models (NeurIPS, 2022).
3. Dosovitskiy et al. — An Image is Worth 16×16 Words (ViT, 2021).
4. Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (2022).
5. Bommasani et al. — On the Opportunities and Risks of Foundation Models (Stanford CRFM, 2021).