Each model ranked by its held-out score — only challenges published after the model's release, which it could not have trained on. Models without enough held-out evidence yet are listed as provisional below. Safety and agentic ability are scored separately.