rotorbench-aero · v0.1 · 2026Q2

Methodology

How we score models, why the test set rotates, and the audit trail behind every published number.

§01How a model gets scored

Each submission travels through a deterministic six-step pipeline. The runner is hermetic — it pulls the artefact, runs against a fixed test split, and posts results back over a signed webhook.

§02Contamination resistance

The private test set rotates every quarter. 50 of the 200 airfoils are replaced with newly-procedurally-perturbed parents under a fixed seed schedule. Submissions cannot pre-train on the test set because it didn't exist when the model was trained.

The 150 stable airfoils give cross-quarter comparability. A model submitted in 2026Q2 and re-run in 2027Q1 will show its drift against the rotated 50 alongside its persistence on the stable 150.

§03The composite score, derived from first principles

The composite is a weighted sum of five normalised metrics. Each weight is chosen so that a real, useful improvement on that metric matches its peers in scoring impact — see the annotations.

composite = 1·MAE_Cl + 10·MAE_Cd + 0.5·MAE_Cm + 0.2·(1 − ρ_LD) + 0.1·OOD_score + 0.001·latency_p50_ms

1·MAE_Cl

primary lift error — directly used in design loops

10·MAE_Cd

weighted up because Cd ranges ~10× narrower than Cl

0.5·MAE_Cm

moment matters less in 2D scoring; halved

0.2·(1 − ρ_LD)

rank correlation already in [0, 1]; small weight is enough

0.1·OOD_score

OOD is a separate guardrail, not the dominant signal

0.001·latency_p50_ms

tiebreaker only — lets a fast model edge a marginally-more-accurate slow one

Lower composite is better. The leaderboard's default sort is by composite ascending.

§04What we archive

Every submission's artefact is sealed in R2 for 24 months. If contamination is later suspected we can replay the same artefact against a fresh test set. The bundle below is what gets archived per submission.

r2://comparotor-submissions/<submission_id>/retained 24 months

MODEL

model.onnxONNX submissions only12–500 MB

§05Audit and replay

If a submission is suspected of having seen the test set during training, we can replay it against a freshly-rotated quarter without the submitter touching anything. The score difference is the audit signal.

Want the full spec?

The complete SPEC.md covers the SU2 RANS generation pipeline, the full D1 schema, and the OpenAPI contract.