LOCAL MODEL ARENA
Live · 16 runs · 2026-06-21

Which AI actually makes
and reasons?

Local and frontier models, put through the same objective tests — games they wrote are playable right here, code is run against hidden tests, reasoning checked against held-out keys. No LLM judges.

Four frontier-class surfaces — three model families, Claude run multiple ways — on the same objective tests. No LLM judges, just whether the code runs. The widest gap is below; two axes (coding, illustration) are ceiling-tied across every model.

100bestglm-5.2
70lowestgrok-4.3
−30art/v1
100top score
4models
16runs
4axes

Model roster — every axis at a glance

claude-opus-4.8 (web)
WEB · MANUAL
Game-making100Monster battle82Coding & reasoning=100Illustration (SVG)87
leads on Game-making
claude-opus-4.8 (Code · ULTRA)
WEB · MANUAL
Game-making75Monster battle82Coding & reasoning=100Illustration (SVG)95
no sole axis lead
glm-5.2
WEB · MANUAL
Game-making85Monster battle100Coding & reasoning=100Illustration (SVG)100
leads on Illustration (SVG)
grok-4.3
WEB · MANUAL
Game-making85Monster battle72Coding & reasoning=100Illustration (SVG)70
no sole axis lead
How scores work
Every run is content-addressed and append-only. Game/battle code runs in an offline headless browser (network disabled) and is played by a generic bot; coding runs against hidden unit tests; reasoning is matched to held-out keys. Scores are correctness only — speed is recorded but never ranked (local and cloud run on different hardware). No model ever judges another.

Game-making

Writes a playable browser game — we test it loads, renders, holds ≥50 fps, is winnable & losable.

4runs
game/v24 runs
  1. 1
    100/100 σ 11.79 · n=3
    claude-opus-4.8 (web)
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (web)
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works15/15
    losable10/10
  2. 2
    85/100 σ 11.22 · n=5
    grok-4.3
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — grok-4.3
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works0/15
    losable10/10
  3. 3
    85/100 σ 11.14 · n=5
    glm-5.2
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — glm-5.2
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works0/15
    losable10/10
  4. 4
    75/100 σ 17.15 · n=5
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL game/v2
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (Code · ULTRA)
    loads15/15
    starts10/10
    JS contract10/10
    renders10/10
    ≥50 fps10/10
    winnable20/20
    input works0/15
    losable0/10

Monster battle

Pokémon-style turn duel — we test the move API works, both sides act, and it resolves to a winner.

4runs
battle/v24 runs
  1. 1
    100/100 σ 0.0 · n=3
    glm-5.2
    WEB · MANUAL battle/v2
    ▶ play this model’s game
    gameplay screenshot — glm-5.2
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats6/6
    power variety8/8
    utility move10/10
    visual density16/16
  2. 2
    82/100 σ 0.0 · n=3
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL battle/v2
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (Code · ULTRA)
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats6/6
    power variety8/8
    utility move0/10
    visual density8/16
  3. 3
    82/100 σ 0.0 · n=3
    claude-opus-4.8 (web)
    WEB · MANUAL battle/v2
    ▶ play this model’s game
    gameplay screenshot — claude-opus-4.8 (web)
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats6/6
    power variety8/8
    utility move0/10
    visual density8/16
  4. 4
    72/100 σ 0.0 · n=3
    grok-4.3
    WEB · MANUAL battle/v2
    ▶ play this model’s game
    gameplay screenshot — grok-4.3
    loads5/5
    starts5/5
    JS contract10/10
    renders5/5
    moves fire5/5
    both fight5/5
    ends cleanly5/5
    winnable12/12
    move count8/8
    move stats0/6
    power variety0/8
    utility move0/10
    visual density12/16

Coding & reasoning

Coding against hidden unit tests + reasoning against held-out keys. No LLM judge.

4runs
text/v24 runs4-way tie · all aced

Ceiling axis — all four models passed every case (6 hard coding tasks + 6 trap reasoning problems). Real separation lives on game-making and monster battle.

  1. =
    100/100 n=1
    grok-4.3
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50
  2. =
    100/100 n=1
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50
  3. =
    100/100 n=1
    glm-5.2
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50
  4. =
    100/100 n=1
    claude-opus-4.8 (web)
    WEB · MANUAL text/v2
    coding (strict)50/50
    reasoning50/50

Illustration (SVG)

Hand-writes an SVG illustration — we test it's valid SVG with real shape detail and color.

4runs
art/v14 runs
  1. 1
    100/100 n=1
    glm-5.2
    WEB · MANUAL art/v1
    SVG illustration by glm-5.2
    loads10/10
    valid SVG10/10
    shape detail30/30
    color variety22/22
    technique16/16
    shape variety12/12
  2. 2
    95/100 n=1
    claude-opus-4.8 (Code · ULTRA)
    WEB · MANUAL art/v1
    SVG illustration by claude-opus-4.8 (Code · ULTRA)
    loads10/10
    valid SVG10/10
    shape detail30/30
    color variety22/22
    technique16/16
    shape variety7/12
  3. 3
    87/100 n=1
    claude-opus-4.8 (web)
    WEB · MANUAL art/v1
    SVG illustration by claude-opus-4.8 (web)
    loads10/10
    valid SVG10/10
    shape detail22/30
    color variety22/22
    technique16/16
    shape variety7/12
  4. 4
    70/100 n=1
    grok-4.3
    WEB · MANUAL art/v1
    SVG illustration by grok-4.3
    loads10/10
    valid SVG10/10
    shape detail14/30
    color variety15/22
    technique9/16
    shape variety12/12