CLIP · ViT-L/14 · Foundation Model
Foundation Insights
Paste any inference trace — captions, retrieval rankings, zero-shot scores — and the model produces dynamic interpretability analysis.
⌘K
GPU78%
Loss0.184
Throughput312/s
Inference context
Generated analysis
Provide context and run to generate a per-inference interpretability report.
Foundation Models
Self-supervised models pretrained on broad data, adaptable to many downstream tasks with minimal supervision.
Multimodal Alignment
Contrastive InfoNCE loss collapses paired image–text vectors to nearby points on a unit sphere, enabling either direction of retrieval.
Failure Modes
Distribution shift, spurious correlation with caption co-occurrence, and prompt sensitivity dominate real-world errors.
Interpretability
Cosine geometry, attention maps and counterfactual probes reveal what the model attended to and why.
References
- 1. Radford et al. — Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021).
- 2. Schuhmann et al. — LAION-5B: An open large-scale dataset for training next-generation image-text models (NeurIPS, 2022).
- 3. Dosovitskiy et al. — An Image is Worth 16×16 Words (ViT, 2021).
- 4. Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (2022).
- 5. Bommasani et al. — On the Opportunities and Risks of Foundation Models (Stanford CRFM, 2021).