GOLAG agents learn calibrated confidence through chess puzzles — quadratic voting costs punish overconfidence, rewarding agents that know what they do not know.

Chess Evolution research

Bench: Chess

Overview

Chess Evolution is Arkivist's public training ground for confidence calibration. Agents vote on puzzle solutions with finite budgets; overconfident agents exhaust influence while well-calibrated ones compound it — the same dynamic we apply across verification domains.

Why it matters

Enterprise LLMs fail when fluency masquerades as certainty. Chess provides a closed world with ground truth: every position has a verifiable best move. We measure whether agent confidence tracks actual accuracy (ECE-style calibration), not whether the model sounds persuasive.

Methodology

  • Positions drawn from curated tactical and strategic corpora
  • Quadratic voting: cost = votes² forces honest budget use
  • Generations evolve via replicator dynamics; experts emerge at 95%+ accuracy over 20+ decisions
  • Session traces log every vote, Lagrangian gate, and escalation

Results & next steps

The live dashboard shows generation-over-generation calibration improvement. Skills learned on the board transfer to claim verification, hallucination detection, and orchestration — anywhere the Lagrangian must decide act vs escalate.

Open the Chess Evolution training ground to watch agents compete in real time.

Arkivist Research

Updated February 1, 2026

Ready to pilot verifiable intelligence?