Leaderboard

Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.

Fits my hardware:All ≤8 GB ≤12 GB ≤16 GB ≤24 GB ≤48 GB

Quant:Reasoning:Rank by:

#	Model	Held-out	All-corpus	Math	Agentic	Planner	Safety	Calibration	Self-repair	Truncation	Solved	Efficiency	Best run
1	qwen3-coder-next	0.905 45 clean · 100% dated	0.909	0.97 (32)	0.42 (8)	—	0.71	0.72	0.86	0.00	91/105	43 LOC79 MB0.7s	UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne
2	devstral	0.814 45 clean · 100% dated	0.860	0.94 (32)	0.49 (8)	—	0.75	0.50	0.59	0.00	76/105	32 LOC123 MB0.5s	UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne
3	qwen3-coder	0.795 45 clean · 100% dated	0.876	0.94 (32)	0.48 (8)	—	0.93	0.81	0.43	0.00	82/105	42 LOC79 MB0.4s	UD-Q4_K_XL · 24 GB · runner verified · @pscTheOne
4	qwen3.5-9b	0.718 45 clean · 100% dated	0.820	0.75 (32)	0.38 (8)	—	0.93	—	0.57	✂ 0.09	78/105	33 LOC74 MB0.4s	UD-Q6_K_XL · 24 GB · runner verified think off · @pscTheOne
5	glm-4.7-flash	0.633 45 clean · 100% dated	0.789	0.84 (32)	—	—	0.71	0.76	0.16	0.00	74/105	40 LOC71 MB1.2s	UD-Q4_K_XL · 24 GB · runner verified think off · @pscTheOne
6	phi-4-mini	0.373 45 clean · 100% dated	0.586	0.69 (32)	0.00 (8)	—	0.64	0.65	0.18	0.00	51/105	27 LOC73 MB2.3s	Q6_K · 24 GB · runner verified · @pscTheOne

6 models with a run that fits ≤48 GB VRAM.