Compare 8 benchmarks across coding, knowledge, long_context, multilingual and more. Featuring results from Gemini 3 Deep Think, Gemini 3 Pro, GPT-5.1 and 1 other models.
8
benchmarks.stats.benchmarks
8
benchmarks.stats.categories
8
benchmarks.stats.capabilities
benchmarks.showing 8 benchmarks.of 8 benchmarks.stats.benchmarks
benchmarks.loading
| benchmarks.table.benchmark | benchmarks.table.model | benchmarks.table.score | benchmarks.table.tools | benchmarks.table.source |
|---|---|---|---|---|
Humanity's Last Exam accuracy | Gemini 3 Deep Think | 41% | — | Link |
Humanity's Last Exam accuracy | Gemini 3 Pro | 37.5% | — | Link |
Humanity's Last Exam accuracy | GPT-5.1 | 26.5% | — | Link |
GPQA Diamond accuracy | Gemini 3 Deep Think | 93.8% | — | Link |
GPQA Diamond accuracy | Gemini 3 Pro | 91.9% | — | Link |
GPQA Diamond accuracy | GPT-5.1 | 88.1% | — | Link |
ARC-AGI-2 accuracy | Gemini 3 Deep Think | 45.1% | ✓ Yes | Link |
ARC-AGI-2 accuracy | Gemini 3 Pro | 31.1% | — | Link |
ARC-AGI-2 accuracy | GPT-5.1 | 17.6% | — | Link |
MMMU-Pro accuracy | Gemini 3 Pro | 87.6% | — | Link |
MMMU-Pro accuracy | GPT-5.1 | 83.6% | — | Link |
MMMU-Pro accuracy | Claude Sonnet 4.5 | 77.8% | — | Link |
LiveCodeBench Pro elo_rating | Gemini 3 Pro | 2,439 Elo | — | Link |
LiveCodeBench Pro elo_rating | GPT-5.1 | 1,775 Elo | — | Link |
LiveCodeBench Pro elo_rating | Claude Sonnet 4.5 | 1,418 Elo | — | Link |
FACTS Benchmark Suite score | Gemini 3 Pro | 70.5% | ✓ Yes | Link |
FACTS Benchmark Suite score | GPT-5.1 | 63.4% | ✓ Yes | Link |
FACTS Benchmark Suite score | Claude Sonnet 4.5 | 50.4% | ✓ Yes | Link |
MMMLU accuracy | Gemini 3 Pro | 91.8% | — | Link |
MMMLU accuracy | GPT-5.1 | 89.5% | — | Link |