CLIP · ViT-L/14 · Foundation Model

Zero-Shot Image Classification

Define text classes on the fly — the multimodal model scores each candidate against the image and we apply temperature-scaled softmax over the cosine similarities.

Execution pipeline
  1. 1Image preprocessing
  2. 2Vision encoder (ViT-L/14)
  3. 3Image embedding (768-D)
  4. 4Text encoders (per label)
  5. 5Cosine similarity scoring
  6. 6Temperature-scaled softmax
  7. 7Reasoning + evidence
Drop an image or click to upload
JPG, PNG, WEBP · scored by Gemini multimodal
Candidate classes
a photograph of a doga photograph of a cata city skyline at nighta tropical beacha microchip closeup
Softmax T0.50
Prediction probabilities
Softmax over candidate classes
Upload an image and run inference.