CLIP · ViT-L/14 · Foundation Model

Image → Text Understanding

Upload an image — the vision encoder produces a 768-D embedding, BLIP-2 generates a caption, and we compute full evaluation analytics.

Execution pipeline
  1. 1Uploading image
  2. 2Image preprocessing
  3. 3Feature extraction (ViT-L/14)
  4. 4Embedding generation (768-D)
  5. 5Cross-modal similarity
  6. 6Caption generation (BLIP-2)
  7. 7Metric evaluation
  8. 8Report generation
Drop an image or click to upload
JPG, PNG, WEBP · analyzed by Gemini multimodal
Predicted caption
BLIP-2 generated description
Upload an image to run multimodal inference.