Capability evolution

Each model's held-out code score against its release date — scored only on challenges published after the model shipped, which it could not have trained on. This is the honest frontier: contaminated (pre-release) challenges are excluded, so the line reflects generalization, not memorization. Hover a point for its held-out sample size.