English 日本語
L O C A L M O D E L A R E N A
Which AI actually makes and reasons ?
Local and frontier models, put through the same objective tests — games they wrote are playable right here , code is run against hidden tests, reasoning checked against held-out keys. No LLM judges.
13 runs
8 models
6 local & free
4 axes
Leaderboard Prompts →
Game-making 7 runs
1
loads_clean 15/15
boots_clean 10/10
contract_full 10/10
canvas_non_blank 10/10
fps>=50 10/10
controlled_win 20/20
input_decisive 15/15
losable 10/10
2
loads_clean 15/15
boots_clean 10/10
contract_full 10/10
canvas_non_blank 10/10
fps>=50 10/10
controlled_win 20/20
input_decisive 15/15
losable 10/10
3
loads_clean 20/20
boots_clean 15/15
canvas_non_blank 15/15
scenario_progress 25/25
win_reached 10/10
fps>=50 15/15
4
loads_clean 15/15
boots_clean 10/10
contract_full 10/10
canvas_non_blank 10/10
fps>=50 10/10
controlled_win 20/20
input_decisive 0/15
losable 10/10
5
loads_clean 15/15
boots_clean 10/10
contract_full 10/10
canvas_non_blank 10/10
fps>=50 10/10
controlled_win 20/20
input_decisive 0/15
losable 10/10
6
loads_clean 15/15
boots_clean 10/10
contract_full 10/10
canvas_non_blank 10/10
fps>=50 10/10
controlled_win 20/20
input_decisive 0/15
losable 0/10
7
loads_clean 15/15
boots_clean 10/10
contract_full 10/10
canvas_non_blank 0/10
fps>=50 0/10
controlled_win 0/20
input_decisive 0/15
losable 0/10
Monster battle (Pokémon-style) 2 runs
1
loads_clean 10/10
boots_clean 10/10
contract_full 20/20
canvas_non_blank 15/15
moves_work 25/25
two_sided 10/10
resolves 10/10
2
loads_clean 10/10
boots_clean 10/10
contract_full 20/20
canvas_non_blank 15/15
moves_work 0/25
two_sided 0/10
resolves 0/10
Illustration (SVG) 1 runs
1
loads_clean 15/15
valid_svg 15/15
detail (shapes) 40/40
color_variety 30/30
Coding & reasoning 3 runs
1
coding (hidden tests) 50/50
reasoning (held-out keys) 50/50
2
coding (hidden tests) 50/50
reasoning (held-out keys) 38/50
3
coding (hidden tests) 50/50
reasoning (held-out keys) 25/50
How it works. Each run is content-addressed and append-only. Game code runs in an offline headless browser (no network); coding tasks run against hidden unit tests; reasoning is matched to held-out answer keys. Scores are correctness ; speed is recorded but not ranked (local vs cloud run on different hardware).
Provenance. LOCAL = free model on an Apple M5 Pro (24GB). FRONTIER = paid cloud API, exact version pinned. WEB · MANUAL = pasted from a chat UI (best-effort, single sample).