CLIP · ViT-L/14 · Foundation Model

Zero-Shot Image Classification

Define text classes on the fly — the multimodal model scores each candidate against the image and we apply temperature-scaled softmax over the cosine similarities.

⌘K

GPU78%

Loss0.184

Throughput312/s

Execution pipeline

1Image preprocessing
2Vision encoder (ViT-L/14)
3Image embedding (768-D)
4Text encoders (per label)
5Cosine similarity scoring
6Temperature-scaled softmax
7Reasoning + evidence

Drop an image or click to upload

JPG, PNG, WEBP · scored by Gemini multimodal

Candidate classes

a photograph of a doga photograph of a cata city skyline at nighta tropical beacha microchip closeup

Softmax T0.50

Prediction probabilities

Softmax over candidate classes

Upload an image and run inference.