LIVE Cheapest: GLM-4.7-Flash $0.000/Mtok in 153 models tracked Updated Jun 25, 2026
Jun 25, 2026
ModelPriceWatch$/Mtok
Home / Rankings

LLM Performance Rankings

Benchmark scores and performance-per-dollar across 61 tested models. Updated with the latest public evaluations.

61
Models benchmarked
9
Benchmarks tracked
24
Providers covered

Best Value — Performance per Dollar

Average benchmark score divided by blended cost ($/Mtok). Higher = more performance for your money.

# Model Provider Avg Score Blended $/M Perf/$ Value Bar
1 GPT OSS 20B Fireworks 98.7% $0.18 533.5
2 Llama 3.1 8B Meta 34% $0.07 523.1
3 Nova Micro Amazon 30.9% $0.09 353.1
4 Gemini 2.5 Flash Google 53.7% $0.19 286.4
5 Nova Lite Amazon 36.9% $0.15 246
6 DeepSeek V4 Flash DeepSeek 51.4% $0.21 244.8
7 Llama 4 Scout Meta 47.6% $0.23 211.6
8 Gemini 2.5 Flash-Lite Google 42.4% $0.25 169.6
9 Grok 4.1 Fast xAI 57% $0.35 162.9
10 GPT-4.1 nano OpenAI 39.6% $0.25 158.4
11 Codestral 2508 Mistral 65% $0.6 108.3
12 Codestral Mistral 61.5% $0.6 102.5
13 DeepSeek V4 Pro DeepSeek 64.1% $0.65 98.2
14 Granite 4 H Medium IBM 36.4% $0.38 97.1
15 MiniMax-M3 MiniMax 55% $0.75 73.3
16 Llama 3.3 70B Meta 43.6% $0.69 63.2
17 Gemini 3.1 Flash-Lite Google 48.7% $0.88 55.7
18 Granite 4 H Large IBM 40.8% $0.75 54.4
19 Mistral Large 3 Mistral 53.6% $1 53.6
20 GPT-4.1 mini OpenAI 50.6% $1 50.6
21 Gemini 3.1 Flash Google 60.3% $1.4 43.1
22 Grok 4.3 xAI 72.4% $1.88 38.6
23 QwQ-Plus Alibaba 57.3% $1.6 35.8
24 Claude Haiku 3.5 Anthropic 78.8% $2.4 32.8
25 Command A Cohere 46.8% $1.5 31.2

Best Reasoning — GPQA Diamond

PhD-level science reasoning. The hardest general reasoning benchmark.

#ModelProviderGPQA ScoreCost $/MPerf/$
1 Claude Opus 4.7 Anthropic 94.2% $15 6.1
2 Claude Mythos 5 Anthropic 94.1% $15 5.6
3 Claude Fable 5 Anthropic 94.1% $30 3.1
4 Claude Opus 4.8 Anthropic 93.6% $15 5.3
5 GPT-5.4 OpenAI 88% $8.75 9
6 Gemini 3.1 Pro Google 88% $7 11.3
7 o4-mini OpenAI 82% $2.75 26.8
8 Grok 4.3 xAI 80% $1.88 38.6
9 o3-mini OpenAI 78% $2.75 25.4
10 Grok 4.20 xAI 78% $4 17.2
11 GPT-5.4 mini OpenAI 75% $2.62 25.4
12 Grok 4 xAI 75% $9 7.4
13 Gemini 3.5 Flash Google 72% $5.25 12.6
14 DeepSeek V4 Pro DeepSeek 72% $0.65 98.2
15 GPT-4.1 OpenAI 70% $5 12.3

Best Coding — SWE-Bench

Real-world software engineering tasks — resolving GitHub issues across popular repos.

#ModelProviderSWE-BenchCost $/MPerf/$
1 Claude Mythos 5 Anthropic 95.5% $15 5.6
2 Claude Fable 5 Anthropic 95% $30 3.1
3 Claude Opus 4.8 Anthropic 88.6% $15 5.3
4 Claude Opus 4.7 Anthropic 87.6% $15 6.1
5 Claude Sonnet 4.5 Anthropic 82% $9 9.1
6 GPT-5.4 OpenAI 75% $8.75 9
7 Gemini 3.1 Pro Google 75% $7 11.3
8 o4-mini OpenAI 68% $2.75 26.8
9 Grok 4.3 xAI 65% $1.88 38.6
10 o3-mini OpenAI 60% $2.75 25.4
11 Grok 4.20 xAI 58% $4 17.2
12 GPT-5.4 mini OpenAI 55% $2.62 25.4
13 Gemini 3.5 Flash Google 55% $5.25 12.6
14 Grok 4 xAI 55% $9 7.4
15 Gemini 2.5 Pro Google 50% $5.62 11.2

Best Math — AIME 2025

American Invitational Mathematics Examination problems — competition-level math.

#ModelProviderAIME ScoreCost $/MPerf/$
1 Gemini 3.1 Pro Google 100% $7 11.3
2 Claude Opus 4.6 Anthropic 99.8% $15 5.8
3 GPT OSS 20B Fireworks 98.7% $0.18 533.5
4 GPT-5.4 OpenAI 95.5% $8.75 9
5 o4-mini OpenAI 88% $2.75 26.8
6 o3-mini OpenAI 85% $2.75 25.4
7 Grok 4.3 xAI 85% $1.88 38.6
8 GPT-5.4 mini OpenAI 82% $2.62 25.4
9 Grok 4.20 xAI 80% $4 17.2
10 Gemini 3.5 Flash Google 78% $5.25 12.6
11 Grok 4 xAI 78% $9 7.4
12 Gemini 2.5 Pro Google 75% $5.62 11.2
13 DeepSeek V4 Pro DeepSeek 75% $0.65 98.2
14 Qwen3-Max Alibaba 72% $3 20.6
15 Gemini 3.1 Flash Google 70% $1.4 43.1

Fastest Models — Inference Speed

Tokens per second on hosted inference. Speed matters for real-time applications and high-throughput pipelines.

#ModelProviderTokens/secCost $/M
1 Llama 4 Scout Meta 2600 t/s $0.23
2 Llama 4 Scout Groq 2600 t/s $0.56
3 Llama 3.3 70B Meta 2500 t/s $0.69
4 Llama 3.3 70B Versatile Groq 2500 t/s $0.79
5 Llama 3.1 8B Meta 1800 t/s $0.07
6 Llama 3.1 8B Instant Groq 1800 t/s $0.53
7 Gemini 3.1 Flash-Lite Google 600 t/s $0.88
8 GPT OSS 20B Fireworks 564 t/s $0.18
9 Gemini 3.1 Flash Google 500 t/s $1.4
10 Gemini 2.5 Flash-Lite Google 450 t/s $0.25
11 Gemini 3.5 Flash Google 400 t/s $5.25
12 Gemini 2.5 Flash Google 350 t/s $0.19
13 Grok 4.1 Fast xAI 300 t/s $0.35
14 Nova Micro Amazon 300 t/s $0.09
15 GPT OSS 120B Fireworks 260 t/s $0.38

Hardest Benchmark — Humanity's Last Exam

Expert-level questions across 100+ fields. No model scores above 65% — the frontier of AI capability.

#ModelProviderHLE ScoreCost $/M
1 Claude Mythos 5 Anthropic 64.5% $15
2 Claude Opus 4.8 Anthropic 57.9% $15
3 Gemini 3.1 Pro Google 45.8% $7
4 GPT-5.5 OpenAI 43.1% $17.5
5 GPT-5.4 OpenAI 40.2% $8.75
6 Grok 4.3 xAI 38% $1.88
7 o4-mini OpenAI 35% $2.75
8 Grok 4.20 xAI 32% $4
9 o3-mini OpenAI 30% $2.75
10 Grok 4 xAI 30% $9

Methodology

Benchmark sources: Vellum LLM Leaderboard, Artificial Analysis, official model technical reports, and HuggingFace Open LLM Leaderboard. Scores are the latest publicly reported results as of June 2026.

Performance per dollar is calculated as avg_benchmark_score / blended_cost_per_mtok, where blended cost is the average of input and output pricing per million tokens. This metric rewards models that deliver strong performance at low cost — the core value proposition for production AI workloads.

Benchmark descriptions:

  • GPQA Diamond — PhD-level questions in biology, chemistry, and physics
  • AIME 2025 — American Invitational Mathematics Examination (competition math)
  • SWE-Bench — Resolving real GitHub issues in popular Python repositories
  • Humanity's Last Exam — Expert-level questions across 100+ academic fields
  • ARC-AGI 2 — Visual reasoning and abstract pattern completion
  • MMMLU — Multilingual version of MMLU (57 subjects, multiple languages)
  • HumanEval — Python code generation from natural language descriptions
  • MATH 500 — Mathematical problem solving across 5 difficulty levels
  • BFCL — Berkeley Function Calling Leaderboard (tool use & function calling)

Access this data programmatically via the benchmarks API endpoint or the API documentation.