Game-making
Writes a playable browser game — we test it loads, renders, holds ≥50 fps, is winnable & losable.
44 runs
game/v2
-
1100/100 σ 11.79 · n=3
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works15/15losable10/10 -
285/100 σ 11.22 · n=5
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works0/15losable10/10 -
385/100 σ 11.14 · n=5
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works0/15losable10/10 -
475/100 σ 17.15 · n=5
loads15/15starts10/10JS contract10/10renders10/10≥50 fps10/10winnable20/20input works0/15losable0/10
Monster battle
Pokémon-style turn duel — we test the move API works, both sides act, and it resolves to a winner.
44 runs
battle/v1
-
1100/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats6/6power variety8/8utility move10/10visual density16/16 -
282/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats6/6power variety8/8utility move0/10visual density8/16 -
382/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats6/6power variety8/8utility move0/10visual density8/16 -
472/100 σ 0.0 · n=3
loads5/5starts5/5JS contract10/10renders5/5moves fire5/5both fight5/5ends cleanly5/5winnable12/12move count8/8move stats0/6power variety0/8utility move0/10visual density12/16
Coding & reasoning
Coding against hidden unit tests + reasoning against held-out keys. No LLM judge.
44 runs
text/v2
-
1100/100 n=1coding (strict)50/50reasoning50/50
-
2100/100 n=1coding (strict)50/50reasoning50/50
-
3100/100 n=1coding (strict)50/50reasoning50/50
-
4100/100 n=1coding (strict)50/50reasoning50/50
Illustration (SVG)
Hand-writes an SVG illustration — we test it's valid SVG with real shape detail and color.
44 runs
art/v1



