What benchmarks are used for LLM performance rankings?

We track GPQA Diamond (PhD-level reasoning), AIME 2025 (competition math), SWE-Bench (software engineering), Humanity's Last Exam (expert-level), ARC-AGI 2 (visual reasoning), MMMLU (multilingual knowledge), HumanEval (code generation), MATH 500 (math problem solving), and BFCL (function calling).

How is performance per dollar calculated?

Performance per dollar is the average benchmark score divided by the blended cost per million tokens (average of input and output pricing). Higher scores at lower cost yield better value rankings.

Where does the benchmark data come from?

Benchmark scores are compiled from the Vellum LLM Leaderboard, Artificial Analysis, official model technical reports, and the HuggingFace Open LLM Leaderboard. Each model's source is listed on its detail page.

LLM Performance Rankings — Benchmark Scores & Value | ModelPriceWatch

Name: LLM Benchmark Performance Rankings
Creator: ModelPriceWatch
License: https://modelpricewatch.com/api/

Best Value — Performance per Dollar

Average benchmark score divided by blended cost ($/Mtok). Higher = more performance for your money.

#	Model	Provider	Avg Score	Blended $/M	Perf/$
1	GPT OSS 20B	Fireworks	98.7%	$0.18	533.5
2	Llama 3.1 8B	Meta	34%	$0.07	523.1
3	Nova Micro	Amazon	30.9%	$0.09	353.1
4	Gemini 2.5 Flash	Google	53.7%	$0.19	286.4
5	Nova Lite	Amazon	36.9%	$0.15	246
6	DeepSeek V4 Flash	DeepSeek	51.4%	$0.21	244.8
7	Llama 4 Scout	Meta	47.6%	$0.23	211.6
8	Gemini 2.5 Flash-Lite	Google	42.4%	$0.25	169.6
9	Grok 4.1 Fast	xAI	57%	$0.35	162.9
10	GPT-4.1 nano	OpenAI	39.6%	$0.25	158.4
11	Codestral 2508	Mistral	65%	$0.6	108.3
12	Codestral	Mistral	61.5%	$0.6	102.5
13	DeepSeek V4 Pro	DeepSeek	64.1%	$0.65	98.2
14	Granite 4 H Medium	IBM	36.4%	$0.38	97.1
15	MiniMax-M3	MiniMax	55%	$0.75	73.3
16	Llama 3.3 70B	Meta	43.6%	$0.69	63.2
17	Gemini 3.1 Flash-Lite	Google	48.7%	$0.88	55.7
18	Granite 4 H Large	IBM	40.8%	$0.75	54.4
19	Mistral Large 3	Mistral	53.6%	$1	53.6
20	GPT-4.1 mini	OpenAI	50.6%	$1	50.6
21	Gemini 3.1 Flash	Google	60.3%	$1.4	43.1
22	Grok 4.3	xAI	72.4%	$1.88	38.6
23	QwQ-Plus	Alibaba	57.3%	$1.6	35.8
24	Claude Haiku 3.5	Anthropic	78.8%	$2.4	32.8
25	Command A	Cohere	46.8%	$1.5	31.2

Best Reasoning — GPQA Diamond

PhD-level science reasoning. The hardest general reasoning benchmark.

#	Model	Provider	GPQA Score	Cost $/M	Perf/$
1	Claude Opus 4.7	Anthropic	94.2%	$15	6.1
2	Claude Mythos 5	Anthropic	94.1%	$15	5.6
3	Claude Fable 5	Anthropic	94.1%	$30	3.1
4	Claude Opus 4.8	Anthropic	93.6%	$15	5.3
5	GPT-5.4	OpenAI	88%	$8.75	9
6	Gemini 3.1 Pro	Google	88%	$7	11.3
7	o4-mini	OpenAI	82%	$2.75	26.8
8	Grok 4.3	xAI	80%	$1.88	38.6
9	o3-mini	OpenAI	78%	$2.75	25.4
10	Grok 4.20	xAI	78%	$4	17.2
11	GPT-5.4 mini	OpenAI	75%	$2.62	25.4
12	Grok 4	xAI	75%	$9	7.4
13	Gemini 3.5 Flash	Google	72%	$5.25	12.6
14	DeepSeek V4 Pro	DeepSeek	72%	$0.65	98.2
15	GPT-4.1	OpenAI	70%	$5	12.3

Best Coding — SWE-Bench

Real-world software engineering tasks — resolving GitHub issues across popular repos.

#	Model	Provider	SWE-Bench	Cost $/M	Perf/$
1	Claude Mythos 5	Anthropic	95.5%	$15	5.6
2	Claude Fable 5	Anthropic	95%	$30	3.1
3	Claude Opus 4.8	Anthropic	88.6%	$15	5.3
4	Claude Opus 4.7	Anthropic	87.6%	$15	6.1
5	Claude Sonnet 4.5	Anthropic	82%	$9	9.1
6	GPT-5.4	OpenAI	75%	$8.75	9
7	Gemini 3.1 Pro	Google	75%	$7	11.3
8	o4-mini	OpenAI	68%	$2.75	26.8
9	Grok 4.3	xAI	65%	$1.88	38.6
10	o3-mini	OpenAI	60%	$2.75	25.4
11	Grok 4.20	xAI	58%	$4	17.2
12	GPT-5.4 mini	OpenAI	55%	$2.62	25.4
13	Gemini 3.5 Flash	Google	55%	$5.25	12.6
14	Grok 4	xAI	55%	$9	7.4
15	Gemini 2.5 Pro	Google	50%	$5.62	11.2

Best Math — AIME 2025

American Invitational Mathematics Examination problems — competition-level math.

#	Model	Provider	AIME Score	Cost $/M	Perf/$
1	Gemini 3.1 Pro	Google	100%	$7	11.3
2	Claude Opus 4.6	Anthropic	99.8%	$15	5.8
3	GPT OSS 20B	Fireworks	98.7%	$0.18	533.5
4	GPT-5.4	OpenAI	95.5%	$8.75	9
5	o4-mini	OpenAI	88%	$2.75	26.8
6	o3-mini	OpenAI	85%	$2.75	25.4
7	Grok 4.3	xAI	85%	$1.88	38.6
8	GPT-5.4 mini	OpenAI	82%	$2.62	25.4
9	Grok 4.20	xAI	80%	$4	17.2
10	Gemini 3.5 Flash	Google	78%	$5.25	12.6
11	Grok 4	xAI	78%	$9	7.4
12	Gemini 2.5 Pro	Google	75%	$5.62	11.2
13	DeepSeek V4 Pro	DeepSeek	75%	$0.65	98.2
14	Qwen3-Max	Alibaba	72%	$3	20.6
15	Gemini 3.1 Flash	Google	70%	$1.4	43.1

Fastest Models — Inference Speed

Tokens per second on hosted inference. Speed matters for real-time applications and high-throughput pipelines.

#	Model	Provider	Tokens/sec	Cost $/M
1	Llama 4 Scout	Meta	2600 t/s	$0.23
2	Llama 4 Scout	Groq	2600 t/s	$0.56
3	Llama 3.3 70B	Meta	2500 t/s	$0.69
4	Llama 3.3 70B Versatile	Groq	2500 t/s	$0.79
5	Llama 3.1 8B	Meta	1800 t/s	$0.07
6	Llama 3.1 8B Instant	Groq	1800 t/s	$0.53
7	Gemini 3.1 Flash-Lite	Google	600 t/s	$0.88
8	GPT OSS 20B	Fireworks	564 t/s	$0.18
9	Gemini 3.1 Flash	Google	500 t/s	$1.4
10	Gemini 2.5 Flash-Lite	Google	450 t/s	$0.25
11	Gemini 3.5 Flash	Google	400 t/s	$5.25
12	Gemini 2.5 Flash	Google	350 t/s	$0.19
13	Grok 4.1 Fast	xAI	300 t/s	$0.35
14	Nova Micro	Amazon	300 t/s	$0.09
15	GPT OSS 120B	Fireworks	260 t/s	$0.38

Hardest Benchmark — Humanity's Last Exam

Expert-level questions across 100+ fields. No model scores above 65% — the frontier of AI capability.

#	Model	Provider	HLE Score	Cost $/M
1	Claude Mythos 5	Anthropic	64.5%	$15
2	Claude Opus 4.8	Anthropic	57.9%	$15
3	Gemini 3.1 Pro	Google	45.8%	$7
4	GPT-5.5	OpenAI	43.1%	$17.5
5	GPT-5.4	OpenAI	40.2%	$8.75
6	Grok 4.3	xAI	38%	$1.88
7	o4-mini	OpenAI	35%	$2.75
8	Grok 4.20	xAI	32%	$4
9	o3-mini	OpenAI	30%	$2.75
10	Grok 4	xAI	30%	$9

Methodology

Benchmark sources: Vellum LLM Leaderboard, Artificial Analysis, official model technical reports, and HuggingFace Open LLM Leaderboard. Scores are the latest publicly reported results as of June 2026.

Performance per dollar is calculated as avg_benchmark_score / blended_cost_per_mtok, where blended cost is the average of input and output pricing per million tokens. This metric rewards models that deliver strong performance at low cost — the core value proposition for production AI workloads.

Benchmark descriptions:

GPQA Diamond — PhD-level questions in biology, chemistry, and physics
AIME 2025 — American Invitational Mathematics Examination (competition math)
SWE-Bench — Resolving real GitHub issues in popular Python repositories
Humanity's Last Exam — Expert-level questions across 100+ academic fields
ARC-AGI 2 — Visual reasoning and abstract pattern completion
MMMLU — Multilingual version of MMLU (57 subjects, multiple languages)
HumanEval — Python code generation from natural language descriptions
MATH 500 — Mathematical problem solving across 5 difficulty levels
BFCL — Berkeley Function Calling Leaderboard (tool use & function calling)

Access this data programmatically via the benchmarks API endpoint or the API documentation.