Held-out coding leaderboard

Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.

Fits my hardware:All≤8 GB≤12 GB≤16 GB≤24 GB≤48 GB

No runs yet that fit in ≤16 GB VRAM. Submit one.