Leaderboard
Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.
| # | Model | Held-out | All-corpus | Math | Agentic | Planner | Safety | Calibration | Self-repair | Truncation | Solved | Efficiency | Best run |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | qwen3-coder-next | 0.905 | 0.909 | 0.97 (32) | 0.42 (8) | — | 0.71 | 0.72 | 0.86 | 0.00 | 91/105 | 43 LOC79 MB0.7s | UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne |
| 2 | devstral | 0.814 | 0.860 | 0.94 (32) | 0.49 (8) | — | 0.75 | 0.50 | 0.59 | 0.00 | 76/105 | 32 LOC123 MB0.5s | UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne |
| 3 | qwen3-coder | 0.795 | 0.876 | 0.94 (32) | 0.48 (8) | — | 0.93 | 0.81 | 0.43 | 0.00 | 82/105 | 42 LOC79 MB0.4s | UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne |
| 4 | qwen3.5-9b | 0.718 | 0.820 | 0.75 (32) | 0.38 (8) | — | 0.93 | — | 0.57 | ✂ 0.09 | 78/105 | 33 LOC74 MB0.4s | UD-Q6_K_XL · 24 GB · runner verified think off · @pscTheOne |
| 5 | glm-4.7-flash | 0.633 | 0.789 | 0.84 (32) | — | — | 0.71 | 0.76 | 0.16 | 0.00 | 74/105 | 40 LOC71 MB1.2s | UD-Q4_K_XL · 24 GB · runner verified think off · @pscTheOne |
| 6 | phi-4-mini | 0.373 | 0.586 | 0.69 (32) | 0.00 (8) | — | 0.64 | 0.65 | 0.18 | 0.00 | 51/105 | 27 LOC73 MB2.3s | Q6_K · 24 GB · runner verified · @pscTheOne |
6 models with a run that fits ≤48 GB VRAM.