LLM Performance Rankings
Benchmark scores and performance-per-dollar across 61 tested models. Updated with the latest public evaluations.
Best Value — Performance per Dollar
Average benchmark score divided by blended cost ($/Mtok). Higher = more performance for your money.
| # | Model | Provider | Avg Score | Blended $/M | Perf/$ | Value Bar |
|---|---|---|---|---|---|---|
| 1 | GPT OSS 20B | Fireworks | 98.7% | $0.18 | 533.5 | |
| 2 | Llama 3.1 8B | Meta | 34% | $0.07 | 523.1 | |
| 3 | Nova Micro | Amazon | 30.9% | $0.09 | 353.1 | |
| 4 | Gemini 2.5 Flash | 53.7% | $0.19 | 286.4 | ||
| 5 | Nova Lite | Amazon | 36.9% | $0.15 | 246 | |
| 6 | DeepSeek V4 Flash | DeepSeek | 51.4% | $0.21 | 244.8 | |
| 7 | Llama 4 Scout | Meta | 47.6% | $0.23 | 211.6 | |
| 8 | Gemini 2.5 Flash-Lite | 42.4% | $0.25 | 169.6 | ||
| 9 | Grok 4.1 Fast | xAI | 57% | $0.35 | 162.9 | |
| 10 | GPT-4.1 nano | OpenAI | 39.6% | $0.25 | 158.4 | |
| 11 | Codestral 2508 | Mistral | 65% | $0.6 | 108.3 | |
| 12 | Codestral | Mistral | 61.5% | $0.6 | 102.5 | |
| 13 | DeepSeek V4 Pro | DeepSeek | 64.1% | $0.65 | 98.2 | |
| 14 | Granite 4 H Medium | IBM | 36.4% | $0.38 | 97.1 | |
| 15 | MiniMax-M3 | MiniMax | 55% | $0.75 | 73.3 | |
| 16 | Llama 3.3 70B | Meta | 43.6% | $0.69 | 63.2 | |
| 17 | Gemini 3.1 Flash-Lite | 48.7% | $0.88 | 55.7 | ||
| 18 | Granite 4 H Large | IBM | 40.8% | $0.75 | 54.4 | |
| 19 | Mistral Large 3 | Mistral | 53.6% | $1 | 53.6 | |
| 20 | GPT-4.1 mini | OpenAI | 50.6% | $1 | 50.6 | |
| 21 | Gemini 3.1 Flash | 60.3% | $1.4 | 43.1 | ||
| 22 | Grok 4.3 | xAI | 72.4% | $1.88 | 38.6 | |
| 23 | QwQ-Plus | Alibaba | 57.3% | $1.6 | 35.8 | |
| 24 | Claude Haiku 3.5 | Anthropic | 78.8% | $2.4 | 32.8 | |
| 25 | Command A | Cohere | 46.8% | $1.5 | 31.2 |
Best Reasoning — GPQA Diamond
PhD-level science reasoning. The hardest general reasoning benchmark.
| # | Model | Provider | GPQA Score | Cost $/M | Perf/$ |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 94.2% | $15 | 6.1 |
| 2 | Claude Mythos 5 | Anthropic | 94.1% | $15 | 5.6 |
| 3 | Claude Fable 5 | Anthropic | 94.1% | $30 | 3.1 |
| 4 | Claude Opus 4.8 | Anthropic | 93.6% | $15 | 5.3 |
| 5 | GPT-5.4 | OpenAI | 88% | $8.75 | 9 |
| 6 | Gemini 3.1 Pro | 88% | $7 | 11.3 | |
| 7 | o4-mini | OpenAI | 82% | $2.75 | 26.8 |
| 8 | Grok 4.3 | xAI | 80% | $1.88 | 38.6 |
| 9 | o3-mini | OpenAI | 78% | $2.75 | 25.4 |
| 10 | Grok 4.20 | xAI | 78% | $4 | 17.2 |
| 11 | GPT-5.4 mini | OpenAI | 75% | $2.62 | 25.4 |
| 12 | Grok 4 | xAI | 75% | $9 | 7.4 |
| 13 | Gemini 3.5 Flash | 72% | $5.25 | 12.6 | |
| 14 | DeepSeek V4 Pro | DeepSeek | 72% | $0.65 | 98.2 |
| 15 | GPT-4.1 | OpenAI | 70% | $5 | 12.3 |
Best Coding — SWE-Bench
Real-world software engineering tasks — resolving GitHub issues across popular repos.
| # | Model | Provider | SWE-Bench | Cost $/M | Perf/$ |
|---|---|---|---|---|---|
| 1 | Claude Mythos 5 | Anthropic | 95.5% | $15 | 5.6 |
| 2 | Claude Fable 5 | Anthropic | 95% | $30 | 3.1 |
| 3 | Claude Opus 4.8 | Anthropic | 88.6% | $15 | 5.3 |
| 4 | Claude Opus 4.7 | Anthropic | 87.6% | $15 | 6.1 |
| 5 | Claude Sonnet 4.5 | Anthropic | 82% | $9 | 9.1 |
| 6 | GPT-5.4 | OpenAI | 75% | $8.75 | 9 |
| 7 | Gemini 3.1 Pro | 75% | $7 | 11.3 | |
| 8 | o4-mini | OpenAI | 68% | $2.75 | 26.8 |
| 9 | Grok 4.3 | xAI | 65% | $1.88 | 38.6 |
| 10 | o3-mini | OpenAI | 60% | $2.75 | 25.4 |
| 11 | Grok 4.20 | xAI | 58% | $4 | 17.2 |
| 12 | GPT-5.4 mini | OpenAI | 55% | $2.62 | 25.4 |
| 13 | Gemini 3.5 Flash | 55% | $5.25 | 12.6 | |
| 14 | Grok 4 | xAI | 55% | $9 | 7.4 |
| 15 | Gemini 2.5 Pro | 50% | $5.62 | 11.2 |
Best Math — AIME 2025
American Invitational Mathematics Examination problems — competition-level math.
| # | Model | Provider | AIME Score | Cost $/M | Perf/$ |
|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro | 100% | $7 | 11.3 | |
| 2 | Claude Opus 4.6 | Anthropic | 99.8% | $15 | 5.8 |
| 3 | GPT OSS 20B | Fireworks | 98.7% | $0.18 | 533.5 |
| 4 | GPT-5.4 | OpenAI | 95.5% | $8.75 | 9 |
| 5 | o4-mini | OpenAI | 88% | $2.75 | 26.8 |
| 6 | o3-mini | OpenAI | 85% | $2.75 | 25.4 |
| 7 | Grok 4.3 | xAI | 85% | $1.88 | 38.6 |
| 8 | GPT-5.4 mini | OpenAI | 82% | $2.62 | 25.4 |
| 9 | Grok 4.20 | xAI | 80% | $4 | 17.2 |
| 10 | Gemini 3.5 Flash | 78% | $5.25 | 12.6 | |
| 11 | Grok 4 | xAI | 78% | $9 | 7.4 |
| 12 | Gemini 2.5 Pro | 75% | $5.62 | 11.2 | |
| 13 | DeepSeek V4 Pro | DeepSeek | 75% | $0.65 | 98.2 |
| 14 | Qwen3-Max | Alibaba | 72% | $3 | 20.6 |
| 15 | Gemini 3.1 Flash | 70% | $1.4 | 43.1 |
Fastest Models — Inference Speed
Tokens per second on hosted inference. Speed matters for real-time applications and high-throughput pipelines.
| # | Model | Provider | Tokens/sec | Cost $/M |
|---|---|---|---|---|
| 1 | Llama 4 Scout | Meta | 2600 t/s | $0.23 |
| 2 | Llama 4 Scout | Groq | 2600 t/s | $0.56 |
| 3 | Llama 3.3 70B | Meta | 2500 t/s | $0.69 |
| 4 | Llama 3.3 70B Versatile | Groq | 2500 t/s | $0.79 |
| 5 | Llama 3.1 8B | Meta | 1800 t/s | $0.07 |
| 6 | Llama 3.1 8B Instant | Groq | 1800 t/s | $0.53 |
| 7 | Gemini 3.1 Flash-Lite | 600 t/s | $0.88 | |
| 8 | GPT OSS 20B | Fireworks | 564 t/s | $0.18 |
| 9 | Gemini 3.1 Flash | 500 t/s | $1.4 | |
| 10 | Gemini 2.5 Flash-Lite | 450 t/s | $0.25 | |
| 11 | Gemini 3.5 Flash | 400 t/s | $5.25 | |
| 12 | Gemini 2.5 Flash | 350 t/s | $0.19 | |
| 13 | Grok 4.1 Fast | xAI | 300 t/s | $0.35 |
| 14 | Nova Micro | Amazon | 300 t/s | $0.09 |
| 15 | GPT OSS 120B | Fireworks | 260 t/s | $0.38 |
Hardest Benchmark — Humanity's Last Exam
Expert-level questions across 100+ fields. No model scores above 65% — the frontier of AI capability.
| # | Model | Provider | HLE Score | Cost $/M |
|---|---|---|---|---|
| 1 | Claude Mythos 5 | Anthropic | 64.5% | $15 |
| 2 | Claude Opus 4.8 | Anthropic | 57.9% | $15 |
| 3 | Gemini 3.1 Pro | 45.8% | $7 | |
| 4 | GPT-5.5 | OpenAI | 43.1% | $17.5 |
| 5 | GPT-5.4 | OpenAI | 40.2% | $8.75 |
| 6 | Grok 4.3 | xAI | 38% | $1.88 |
| 7 | o4-mini | OpenAI | 35% | $2.75 |
| 8 | Grok 4.20 | xAI | 32% | $4 |
| 9 | o3-mini | OpenAI | 30% | $2.75 |
| 10 | Grok 4 | xAI | 30% | $9 |
Methodology
Benchmark sources: Vellum LLM Leaderboard, Artificial Analysis, official model technical reports, and HuggingFace Open LLM Leaderboard. Scores are the latest publicly reported results as of June 2026.
Performance per dollar is calculated as avg_benchmark_score / blended_cost_per_mtok, where blended cost is the average of input and output pricing per million tokens. This metric rewards models that deliver strong performance at low cost — the core value proposition for production AI workloads.
Benchmark descriptions:
- GPQA Diamond — PhD-level questions in biology, chemistry, and physics
- AIME 2025 — American Invitational Mathematics Examination (competition math)
- SWE-Bench — Resolving real GitHub issues in popular Python repositories
- Humanity's Last Exam — Expert-level questions across 100+ academic fields
- ARC-AGI 2 — Visual reasoning and abstract pattern completion
- MMMLU — Multilingual version of MMLU (57 subjects, multiple languages)
- HumanEval — Python code generation from natural language descriptions
- MATH 500 — Mathematical problem solving across 5 difficulty levels
- BFCL — Berkeley Function Calling Leaderboard (tool use & function calling)
Access this data programmatically via the benchmarks API endpoint or the API documentation.