CLIP · ViT-L/14 · Foundation Model
Zero-Shot Image Classification
Define text classes on the fly — the multimodal model scores each candidate against the image and we apply temperature-scaled softmax over the cosine similarities.
⌘K
GPU78%
Loss0.184
Throughput312/s
Execution pipeline
- 1Image preprocessing
- 2Vision encoder (ViT-L/14)
- 3Image embedding (768-D)
- 4Text encoders (per label)
- 5Cosine similarity scoring
- 6Temperature-scaled softmax
- 7Reasoning + evidence
Drop an image or click to upload
JPG, PNG, WEBP · scored by Gemini multimodal
Candidate classes
a photograph of a doga photograph of a cata city skyline at nighta tropical beacha microchip closeup
Softmax T0.50
Prediction probabilities
Softmax over candidate classes
Upload an image and run inference.