CLIP · ViT-L/14 · Foundation Model

Image → Text Understanding

Upload an image — the vision encoder produces a 768-D embedding, BLIP-2 generates a caption, and we compute full evaluation analytics.

⌘K

GPU78%

Loss0.184

Throughput312/s

Execution pipeline

Drop an image or click to upload

JPG, PNG, WEBP · analyzed by Gemini multimodal

Predicted caption

BLIP-2 generated description

Upload an image to run multimodal inference.