CLIP · ViT-L/14 · Foundation Model

Multimodal Search Engine

Real text embeddings × real image-caption embeddings — every cell of the matrix is a true cosine similarity in shared 768-D space.

⌘K

GPU78%

Loss0.184

Throughput312/s

Execution pipeline

Query phrases

Top-K

Run retrieval to compute a real text↔image similarity matrix.