Game-making
Writes a playable browser game — we test it loads, renders, holds ≥50 fps, is winnable & losable.
-
1100/100 σ 11.79 · n=3
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works15/15losable10/10 -
285/100 σ 11.22 · n=5
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works0/15losable10/10 -
385/100 σ 11.14 · n=5
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works0/15losable10/10 -
475/100 σ 17.15 · n=5
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works0/15losable0/10
Monster battle
Pokémon-style turn duel — we test the move API works, both sides act, and it resolves to a winner.
-
1100/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats6/6power variety8/8utility move10/10visual density16/16 -
282/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats6/6power variety8/8utility move0/10visual density8/16 -
382/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats6/6power variety8/8utility move0/10visual density8/16 -
472/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats0/6power variety0/8utility move0/10visual density12/16
Coding & reasoning
Coding against hidden unit tests + reasoning against held-out keys. No LLM judge.
Ceiling axis — all four models passed every case (6 hard coding tasks + 6 trap reasoning problems). Real separation lives on game-making and monster battle.
-
=100/100 n=1coding (strict)50/50reasoning50/50
-
=100/100 n=1coding (strict)50/50reasoning50/50
-
=100/100 n=1coding (strict)50/50reasoning50/50
-
=100/100 n=1coding (strict)50/50reasoning50/50
Illustration (SVG)
Hand-writes an SVG illustration — we test it's valid SVG with real shape detail and color.
-
1100/100 n=1
loads10/10valid SVG10/10shape detail30/30color variety22/22technique16/16shape variety12/12 -
295/100 n=1
loads10/10valid SVG10/10shape detail30/30color variety22/22technique16/16shape variety7/12 -
387/100 n=1
loads10/10valid SVG10/10shape detail22/30color variety22/22technique16/16shape variety7/12 -
470/100 n=1
loads10/10valid SVG10/10shape detail14/30color variety15/22technique9/16shape variety12/12