Local Model Arena — AI model benchmark

Game-making

Writes a playable browser game — we test it loads, renders, holds ≥50 fps, is winnable & losable.

4runs

game/v24 runs

1

100/100 σ 11.79 · n=3

claude-opus-4.8 (web)

WEB · MANUAL game/v2
▶ play this model’s game

loads15/15
starts10/10
JS contract10/10
renders10/10
≥50 fps10/10
winnable20/20
input works15/15
losable10/10
2

85/100 σ 11.22 · n=5

grok-4.3

WEB · MANUAL game/v2
▶ play this model’s game

loads15/15
starts10/10
JS contract10/10
renders10/10
≥50 fps10/10
winnable20/20
input works0/15
losable10/10
3

85/100 σ 11.14 · n=5

glm-5.2

WEB · MANUAL game/v2
▶ play this model’s game

loads15/15
starts10/10
JS contract10/10
renders10/10
≥50 fps10/10
winnable20/20
input works0/15
losable10/10
4

75/100 σ 17.15 · n=5

claude-opus-4.8 (Code · ULTRA)

WEB · MANUAL game/v2
▶ play this model’s game

loads15/15
starts10/10
JS contract10/10
renders10/10
≥50 fps10/10
winnable20/20
input works0/15
losable0/10

Monster battle

Pokémon-style turn duel — we test the move API works, both sides act, and it resolves to a winner.

4runs

battle/v24 runs

1

100/100 σ 0.0 · n=3

glm-5.2

WEB · MANUAL battle/v2
▶ play this model’s game

loads5/5
starts5/5
JS contract10/10
renders5/5
moves fire5/5
both fight5/5
ends cleanly5/5
winnable12/12
move count8/8
move stats6/6
power variety8/8
utility move10/10
visual density16/16
2

82/100 σ 0.0 · n=3

claude-opus-4.8 (Code · ULTRA)

WEB · MANUAL battle/v2
▶ play this model’s game

loads5/5
starts5/5
JS contract10/10
renders5/5
moves fire5/5
both fight5/5
ends cleanly5/5
winnable12/12
move count8/8
move stats6/6
power variety8/8
utility move0/10
visual density8/16
3

82/100 σ 0.0 · n=3

claude-opus-4.8 (web)

WEB · MANUAL battle/v2
▶ play this model’s game

loads5/5
starts5/5
JS contract10/10
renders5/5
moves fire5/5
both fight5/5
ends cleanly5/5
winnable12/12
move count8/8
move stats6/6
power variety8/8
utility move0/10
visual density8/16
4

72/100 σ 0.0 · n=3

grok-4.3

WEB · MANUAL battle/v2
▶ play this model’s game

loads5/5
starts5/5
JS contract10/10
renders5/5
moves fire5/5
both fight5/5
ends cleanly5/5
winnable12/12
move count8/8
move stats0/6
power variety0/8
utility move0/10
visual density12/16

Coding & reasoning

Coding against hidden unit tests + reasoning against held-out keys. No LLM judge.

4runs

text/v24 runs4-way tie · all aced

Ceiling axis — all four models passed every case (6 hard coding tasks + 6 trap reasoning problems). Real separation lives on game-making and monster battle.

=

100/100 n=1

grok-4.3

WEB · MANUAL text/v2

coding (strict)50/50
reasoning50/50
=

100/100 n=1

claude-opus-4.8 (Code · ULTRA)

WEB · MANUAL text/v2

coding (strict)50/50
reasoning50/50
=

100/100 n=1

glm-5.2

WEB · MANUAL text/v2

coding (strict)50/50
reasoning50/50
=

100/100 n=1

claude-opus-4.8 (web)

WEB · MANUAL text/v2

coding (strict)50/50
reasoning50/50

Illustration (SVG)

Hand-writes an SVG illustration — we test it's valid SVG with real shape detail and color.

4runs

art/v14 runs

1

100/100 n=1

glm-5.2

WEB · MANUAL art/v1

loads10/10
valid SVG10/10
shape detail30/30
color variety22/22
technique16/16
shape variety12/12
2

95/100 n=1

claude-opus-4.8 (Code · ULTRA)

WEB · MANUAL art/v1

loads10/10
valid SVG10/10
shape detail30/30
color variety22/22
technique16/16
shape variety7/12
3

87/100 n=1

claude-opus-4.8 (web)

WEB · MANUAL art/v1

loads10/10
valid SVG10/10
shape detail22/30
color variety22/22
technique16/16
shape variety7/12
4

70/100 n=1

grok-4.3

WEB · MANUAL art/v1

loads10/10
valid SVG10/10
shape detail14/30
color variety15/22
technique9/16
shape variety12/12