LOCAL MODEL ARENA
L O C A L   M O D E L   A R E N A

How scores are earned, not judged

No model ever grades another. Every score comes from running the model’s own output against mechanical checks in an isolated sandbox. Here is exactly how.

01The four axes

Models are tested where output is objectively checkable code, not prose. Game-making and Monster battle produce a playable HTML5 canvas app; Illustration hand-writes an SVG; Coding & reasoning answers held-out tasks.

02Rubrics & weights

Each axis has a fixed, versioned rubric of mechanical tiers (e.g. battle/v1: loads 10 · starts 10 · JS contract 20 · renders 15 · moves fire 25 · both fight 10 · ends cleanly 10). The breakdown on each row shows exactly which tiers a model passed — the title on hover carries the raw key.

03Anti-gaming: idle vs controlled

A game that auto-wins while untouched would game a naive scorer. So every game is played twice — idle (no input) and controlled (a generic bot plays). Points for skill require the bot to beat idle, and the game must be losable. A frozen or auto-win build fails these tiers.

04Type-strict contract

Games expose a fixed window.__game API. The scorer checks types strictly — typeof score === 'number', not a function — so a model can’t fake a live value with a getter that returns a function. The bot drives play only through this contract.

05K-sampling & variance

Game RNG makes a single play noisy, so each artifact is scored multiple times and we report the median plus σ (spread across trials). A high σ means inconsistent quality, and it’s shown on every row.

06Offline isolation

Untrusted model code runs in a headless Chromium with the network disabled — it can’t phone home or exfiltrate. On the published site, games are embedded under a strict default-src 'none' sandbox.

07Provenance tiers

FRONTIER · PAID = the metered cloud API, exact version pinned (highest rigor). LOCAL · FREE = an open model run on a 24GB Apple machine. WEB · MANUAL = pasted from a chat UI: a single, trust-based sample, labelled distinctly because it’s lower-rigor.

08Worked example

Claude Opus 4.8 on game/v2 across 3 trials scored {75, 100, 100} → median 100, σ 11.79. qwen2.5-coder:14b on battle/v1 built a valid contract but moves never fired and HP never changed → 55: the rubric caught a battle that looks right but doesn’t function.