Methodology
How we score models, why the test set rotates, and the audit trail behind every published number.
§01How a model gets scored
Each submission travels through a deterministic six-step pipeline. The runner is hermetic — it pulls the artefact, runs against a fixed test split, and posts results back over a signed webhook.
§02Contamination resistance
The private test set rotates every quarter. 50 of the 200 airfoils are replaced with newly-procedurally-perturbed parents under a fixed seed schedule. Submissions cannot pre-train on the test set because it didn't exist when the model was trained.
The 150 stable airfoils give cross-quarter comparability. A model submitted in 2026Q2 and re-run in 2027Q1 will show its drift against the rotated 50 alongside its persistence on the stable 150.
§03The composite score, derived from first principles
The composite is a weighted sum of five normalised metrics. Each weight is chosen so that a real, useful improvement on that metric matches its peers in scoring impact — see the annotations.
Lower composite is better. The leaderboard's default sort is by composite ascending.
§04What we archive
Every submission's artefact is sealed in R2 for 24 months. If contamination is later suspected we can replay the same artefact against a fresh test set. The bundle below is what gets archived per submission.
§05Audit and replay
If a submission is suspected of having seen the test set during training, we can replay it against a freshly-rotated quarter without the submitter touching anything. The score difference is the audit signal.
Want the full spec?
The complete SPEC.md covers the SU2 RANS generation pipeline, the full D1 schema, and the OpenAPI contract.