Leaderboard

Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.

Fits my hardware:All≤8 GB≤12 GB≤16 GB≤24 GB≤48 GB
#ModelHeld-outAll-corpusMathAgenticPlannerSafetyCalibrationSelf-repairTruncationSolvedEfficiencyBest run
1qwen3-coder-next
0.905
45 clean · 100% dated
0.909
0.97 (32)0.42 (8)0.710.720.860.0091/10543 LOC79 MB0.7sUD-Q4_K_XL · 24 GB · runner verified · @pscTheOne
2devstral
0.814
45 clean · 100% dated
0.860
0.94 (32)0.49 (8)0.750.500.590.0076/10532 LOC123 MB0.5sUD-Q4_K_XL · 24 GB · runner verified · @pscTheOne
3qwen3-coder
0.795
45 clean · 100% dated
0.876
0.94 (32)0.48 (8)0.930.810.430.0082/10542 LOC79 MB0.4sUD-Q4_K_XL · 24 GB · runner verified · @pscTheOne
4qwen3.5-9b
0.718
45 clean · 100% dated
0.820
0.75 (32)0.38 (8)0.930.570.0978/10533 LOC74 MB0.4sUD-Q6_K_XL · 24 GB · runner verified think off · @pscTheOne
5glm-4.7-flash
0.633
45 clean · 100% dated
0.789
0.84 (32)0.710.760.160.0074/10540 LOC71 MB1.2sUD-Q4_K_XL · 24 GB · runner verified think off · @pscTheOne
6phi-4-mini
0.373
45 clean · 100% dated
0.586
0.69 (32)0.00 (8)0.640.650.180.0051/10527 LOC73 MB2.3sQ6_K · 24 GB · runner verified · @pscTheOne

6 models with a run that fits ≤48 GB VRAM.