LOCAL MODEL ARENA
Live · 16 runs

Which AI actually makes
and reasons?

Local and frontier models, put through the same objective tests — games they wrote are playable right here, code is run against hidden tests, reasoning checked against held-out keys. No LLM judges.

Three frontier-class models on the same objective tests — no LLM judges, just whether the code runs. On game-making (game/v2), this is where they actually stand.

100best
75lowest
−25
100top score
4models
16runs
4axes
How scores work
Every run is content-addressed and append-only. Game/battle code runs in an offline headless browser (network disabled) and is played by a generic bot; coding runs against hidden unit tests; reasoning is matched to held-out keys. Scores are correctness only — speed is recorded but never ranked (local and cloud run on different hardware). No model ever judges another.

Game-making

Writes a playable browser game — we test it loads, renders, holds ≥50 fps, is winnable & losable.

44 runs
game/v24 runs
  1. 1
    100/100 σ 11.79 · n=3
    claude-opus-4.8 (web)
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (web)
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works15/15
    losable10/10
  2. 2
    85/100 σ 11.22 · n=5
    grok-4.3
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — grok-4.3
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works0/15
    losable10/10
  3. 3
    85/100 σ 11.14 · n=5
    glm-5.2
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — glm-5.2
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works0/15
    losable10/10
  4. 4
    75/100 σ 17.15 · n=5
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (Code · ULTRA)
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works0/15
    losable0/10

Monster battle

Pokémon-style turn duel — we test the move API works, both sides act, and it resolves to a winner.

44 runs
battle/v14 runs
  1. 1
    100/100 σ 0.0 · n=3
    glm-5.2
    WEB · MANUAL battle/v1
    ▶ play this model’s game
    gameplay screenshot — glm-5.2
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats6/6
    power variety8/8
    utility move10/10
    visual density16/16
  2. 2
    82/100 σ 0.0 · n=3
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL battle/v1
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (Code · ULTRA)
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats6/6
    power variety8/8
    utility move0/10
    visual density8/16
  3. 3
    82/100 σ 0.0 · n=3
    claude-opus-4.8 (web)
    WEB · MANUAL battle/v1
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (web)
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats6/6
    power variety8/8
    utility move0/10
    visual density8/16
  4. 4
    72/100 σ 0.0 · n=3
    grok-4.3
    WEB · MANUAL battle/v1
    ▶ play this model’s game
    gameplay screenshot — grok-4.3
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats0/6
    power variety0/8
    utility move0/10
    visual density12/16

Coding & reasoning

Coding against hidden unit tests + reasoning against held-out keys. No LLM judge.

44 runs
text/v24 runs
  1. 1
    100/100 n=1
    grok-4.3
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50
  2. 2
    100/100 n=1
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50
  3. 3
    100/100 n=1
    glm-5.2
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50
  4. 4
    100/100 n=1
    claude-opus-4.8 (web)
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50

Illustration (SVG)

Hand-writes an SVG illustration — we test it's valid SVG with real shape detail and color.

44 runs
art/v14 runs
  1. 1
    100/100 n=1
    grok-4.3
    WEB · MANUAL art/v1
    SVG illustration by grok-4.3
    loads15/15
    valid SVG15/15
    shape detail40/40
    color variety30/30
  2. 2
    100/100 n=1
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL art/v1
    SVG illustration by claude-opus-4.8 (Code · ULTRA)
    loads15/15
    valid SVG15/15
    shape detail40/40
    color variety30/30
  3. 3
    100/100 n=1
    glm-5.2
    WEB · MANUAL art/v1
    SVG illustration by glm-5.2
    loads15/15
    valid SVG15/15
    shape detail40/40
    color variety30/30
  4. 4
    100/100 n=1
    claude-opus-4.8 (web)
    WEB · MANUAL art/v1
    SVG illustration by claude-opus-4.8 (web)
    loads15/15
    valid SVG15/15
    shape detail40/40
    color variety30/30