CLIP · ViT-L/14 · Foundation Model

Multimodal Search Engine

Real text embeddings × real image-caption embeddings — every cell of the matrix is a true cosine similarity in shared 768-D space.

Execution pipeline
  1. 1Tokenize prompts
  2. 2Encode text queries
  3. 3Encode image captions
  4. 4Cosine similarity (N×M)
  5. 5Top-K retrieval & ranking
  6. 6Diversity re-rank
  7. 7Insight generation
Query phrases
Top-K
Run retrieval to compute a real text↔image similarity matrix.