CLIP · ViT-L/14 · Foundation Model
Image → Text Understanding
Upload an image — the vision encoder produces a 768-D embedding, BLIP-2 generates a caption, and we compute full evaluation analytics.
⌘K
GPU78%
Loss0.184
Throughput312/s
Execution pipeline
- 1Uploading image
- 2Image preprocessing
- 3Feature extraction (ViT-L/14)
- 4Embedding generation (768-D)
- 5Cross-modal similarity
- 6Caption generation (BLIP-2)
- 7Metric evaluation
- 8Report generation
Drop an image or click to upload
JPG, PNG, WEBP · analyzed by Gemini multimodal
Predicted caption
BLIP-2 generated description
Upload an image to run multimodal inference.