CLIP · ViT-L/14 · Foundation Model
Multimodal Search Engine
Real text embeddings × real image-caption embeddings — every cell of the matrix is a true cosine similarity in shared 768-D space.
⌘K
GPU78%
Loss0.184
Throughput312/s
Execution pipeline
- 1Tokenize prompts
- 2Encode text queries
- 3Encode image captions
- 4Cosine similarity (N×M)
- 5Top-K retrieval & ranking
- 6Diversity re-rank
- 7Insight generation
Query phrases
Top-K
Run retrieval to compute a real text↔image similarity matrix.