Methodology — Local Model Arena

The four axes Rubrics & weights Anti-gaming: idle vs controlled Type-strict contract K-sampling & variance Offline isolation Provenance tiers Worked example

01The four axes

Models are tested where output is objectively checkable code, not prose. Game-making and Monster battle produce a playable HTML5 canvas app; Illustration hand-writes an SVG; Coding & reasoning answers held-out tasks.

02Rubrics & weights

Each axis has a fixed, versioned rubric of mechanical tiers (e.g. battle/v1: loads 10 · starts 10 · JS contract 20 · renders 15 · moves fire 25 · both fight 10 · ends cleanly 10). The breakdown on each row shows exactly which tiers a model passed — the title on hover carries the raw key.

03Anti-gaming: idle vs controlled

A game that auto-wins while untouched would game a naive scorer. So every game is played twice — idle (no input) and controlled (a generic bot plays). Points for skill require the bot to beat idle, and the game must be losable. A frozen or auto-win build fails these tiers.

04Type-strict contract

Games expose a fixed window.__game API. The scorer checks types strictly — typeof score === 'number', not a function — so a model can’t fake a live value with a getter that returns a function. The bot drives play only through this contract.

05K-sampling & variance

Game RNG makes a single play noisy, so each artifact is scored multiple times and we report the median plus σ (spread across trials). A high σ means inconsistent quality, and it’s shown on every row.

06Offline isolation

Untrusted model code runs in a headless Chromium with the network disabled — it can’t phone home or exfiltrate. On the published site, games are embedded under a strict default-src 'none' sandbox.

07Provenance tiers

FRONTIER · PAID = the metered cloud API, exact version pinned (highest rigor). LOCAL · FREE = an open model run on a 24GB Apple machine. WEB · MANUAL = pasted from a chat UI: a single, trust-based sample, labelled distinctly because it’s lower-rigor.

08Worked example

Claude Opus 4.8 on game/v2 across 3 trials scored {75, 100, 100} → median 100, σ 11.79. qwen2.5-coder:14b on battle/v1 built a valid contract but moves never fired and HP never changed → 55: the rubric caught a battle that looks right but doesn’t function.