Held-out coding leaderboard
Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.
| # | Model | Held-out | All-corpus | Math | Agentic | Planner | Safety | Calibration | Self-repair | Truncation | Solved | Efficiency | Best run |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | qwen3-coder-next | 0.891 | 0.730 | 0.68 (22) | — | — | 0.71 | 0.69 | 0.56 | ✂ 0.25 | 111/159 | 36 LOC69 MB0.3s | UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne |
| 2 | qwen3-coder | 0.869 | 0.720 | 0.36 (22) | — | — | 0.93 | 0.77 | 0.30 | ✂ 0.10 | 102/159 | 41 LOC67 MB0.3s | UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne |
| 3 | phi-4-mini | 0.319 | 0.444 | 0.04 (22) | — | — | 0.64 | 0.68 | 0.07 | ✂ 0.03 | 57/159 | 26 LOC62 MB1.7s | Q6_K · 24 GB · runner verified · @pscTheOne |
3 models with a run that fits ≤24 GB VRAM.