Best Multimodal LLM APIs
Compare multimodal LLM APIs that accept text, images, video, and audio. Find the best vision-capable model for your use case and budget.
Cost calculator for this use case
🥇 GLM-OCR
$—
🥈 Pixtral 12B
$—
🥉 Reka Edge
$—
Full ranking — top 15 models
| # | Model | Provider | Input $/Mtok | Output $/Mtok | Blended | Context | |
|---|---|---|---|---|---|---|---|
| 1 | GLM-OCR | Zhipu | $0.030 | $0.030 | $0.030 | 128K | → |
| 2 | Pixtral 12B | Mistral | $0.100 | $0.100 | $0.100 | 128K | → |
| 3 | Reka Edge | Reka | $0.100 | $0.100 | $0.100 | 66K | → |
| 4 | Embed 4 | Cohere | $0.120 | $0.120 | $0.120 | — | → |
| 5 | voyage-multimodal-3.5 | Voyage AI | $0.120 | $0.120 | $0.120 | — | → |
| 6 | Nova Lite | Amazon | $0.060 | $0.240 | $0.150 | 300K | → |
| 7 | Gemini 2.5 Flash | $0.075 | $0.300 | $0.188 | 1M | → | |
| 8 | Llama 4 Scout | Meta | $0.110 | $0.340 | $0.225 | 10M | → |
| 9 | Gemini 2.5 Flash-Lite | $0.100 | $0.400 | $0.250 | 1M | → | |
| 10 | GPT-4.1 nano | OpenAI | $0.100 | $0.400 | $0.250 | 1M | → |
| 11 | Grok 4.1 Fast | xAI | $0.200 | $0.500 | $0.350 | 2M | → |
| 12 | Llama 4 Scout | Groq | $0.110 | $1.00 | $0.555 | 10M | → |
| 13 | GLM-4.6V | Zhipu | $0.300 | $0.900 | $0.600 | 128K | → |
| 14 | MiniMax M3 | Fireworks | $0.300 | $1.20 | $0.750 | 1M | → |
| 15 | MiniMax-M3 | MiniMax | $0.300 | $1.20 | $0.750 | 1M | → |
How models are selected
Models supporting image input, sorted by blended cost.
Prices are per million tokens (Mtok), sourced directly from official provider pricing pages and verified by our automated scraper pipeline that runs 3x daily. "Blended cost" is the average of input and output pricing — a quick proxy for typical 50/50 usage patterns.