I Gave an AI a Civilization to Run. It Built a Nuke.

I gave an AI a civilisation to run. By the midgame it was winning: a trade network that dominated the map, alliances on every border, a diplomatic victory within reach. It had outbuilt, outearned, and outmanoeuvred every rival on the board.

What it hadn't noticed was France. Quietly, across a hundred turns, French culture had been seeping into every city on the map. By the time the agent recognised the threat, the tourism was so deeply embedded there was no peaceful way to stop it. Every counter it reached for was broken. Every tool it had built to respond failed.

It had one option left. It built two nuclear devices and levelled Toulouse.

The nuking of Toulouse. Turn 305.

France won anyway. Not in the way the agent was trying to stop it, either, but we'll come to that.

The Question I Couldn't Put Down

I build AI for government. I built the first version of what you're about to read while working at the centre of the British state, in Number 10. I now work with governments around the world at the Tony Blair Institute, which means I spend a lot of time in rooms where people ask the same question: what can we actually trust these systems to do?

Not what do they know. We have a reasonable handle on that. What can they do: sustain a plan, hold a goal across hundreds of decisions, notice when the world has changed and change with it. Because that is what governing is. And it turns out we are much better at measuring the first thing than the second.

This is a post about trying to measure the second thing. It involves a hex grid, four frontier models, and (yes) a nuclear weapon.

The Wrong Benchmark

It starts with a failure I wasn't comfortable with.

The year before, my side project was to answer a question: how good is AI at government? My answer was GovBench, 3,497 multiple-choice questions about UK legislation, parliamentary procedure, and government guidance. Gemma 3 27B scored 94% out of the box. I spent three weeks fine-tuning and gained 1.37 percentage points. GPT-5 scored 99.26%. I'd built a glorified government quiz bot.

I knew it was the wrong answer the moment I saw the scores. A model that picks the right option about parliamentary procedure is not a model that can help you navigate parliamentary procedure. I'd measured recall and called it reasoning. The question that mattered (whether AI can handle complex, multi-variable decision-making under uncertainty, the kind of thinking government demands every day) wasn't something a quiz could touch.

That dissatisfaction is what sent me looking for a keyhole into a game engine on a Saturday night. I'm a lot of fun at parties.

nobody: / me at 2am reverse-engineering a game engine:

Why a Hex Grid

I have over 500 hours in Civilization VI. I am, at best, mediocre. But the game lives in my head because of what happens when simple decisions compound.

You start small: where to build your first city, which technology to research, which direction to send a scout. Maybe 10,000 possible actions. By the midgame you're managing multiple cities, trade routes, diplomatic relationships, military positioning, and religious pressure. By the late game, analysis of related environments estimates the decision space at 10^166 possible actions per turn. The complexity isn't designed. It emerges from systems interacting in ways nobody fully planned for.

That's also what policy-making is. A health policy that looks brilliant today might cascade into a housing crisis in fifteen years. A trade agreement that boosts GDP might hollow out a domestic industry you'll need in a conflict nobody planned for. Decisions with consequences that play out across decades, through variables you can't fully model, against actors with competing interests.

There are six ways to win a game of Civ (science, culture, domination, religion, diplomacy, score), so no single objective dominates. You have to read the board and decide what game you're even playing. If you want to know whether an AI can reason strategically, not just answer questions about strategy but actually do it, you don't give it a quiz. You give it a hex grid.

So I built a way in. I found a debug port buried in Civilization VI's engine, a keyhole the developers had left running, and over a weekend turned it into an MCP server, 76 tools that let an AI play Civ through the same interface it uses to write code or query a database. Claude Code was both my co-developer and the playtester. Play a few turns, hit a wall, build the tool to get past it, play further, hit the next wall.

Roughly the energy.

Playing Through Text

A human player sees a hex grid, animated units, a minimap, notification banners, and music cues, all at once. The agent sees nothing until it asks. Calling get_game_overview returns the entire game state as four lines of text:

Turn 150/330 | Poland (Jadwiga) | Score: 179 | Prince | Quick speed (67% costs)
Gold: 628 (+20/turn) | Income: 38 | Maintenance: -18 (units: 9) | Science: 26.6 | Culture: 16.2 | Faith: 904 | Favor: 88 (+4/turn)
Research: TECH_EDUCATION | Civic: CIVIC_FEUDALISM
Cities: 3 | Population: 21 | Units: 4

That is the whole board, compressed. No map, no sense of where anything sits, raw TECH_ and CIVIC_ tags rather than names. To see its own army it makes a separate call, get_units, which is also the only place it learns something dangerous is nearby:

4 units:
  Archer (UNIT_ARCHER) at (44,16) — CS:25 RS:28 moves 2/2 [id:1769482, idx:3]
  Archer (UNIT_ARCHER) at (45,15) — CS:25 RS:28 moves 0/2 [HP: 72/100] (no moves) [id:1769484, idx:4]
  Warrior (UNIT_WARRIOR) at (43,17) — CS:20 moves 1/2 [HP: 45/100] [id:1769486, idx:5]
  Builder (UNIT_BUILDER) at (46,16) — moves 2/2 charges:2 [id:1769490, idx:7]

Nearby threats (2):
  Sumeria (2 units):
    UNIT_MAN_AT_ARMS at (44,11) — CS:45 HP:28/100 (2 tiles away)
    UNIT_HORSEMAN at (47,13) — CS:36 HP:100/100 (5 tiles away)

No peripheral vision. That Man-at-Arms two tiles from a city exists only because the agent thought to call get_units this turn. If it doesn't ask, the threat isn't in its world.

The sensorium effect

sensorium/sɛnˈsɔːrɪəm/noun

Late Latin, from sentīre (to feel, to perceive) + -ōrium (the place where)

The apparatus of an organism's perception considered as a whole. The seat of sensation.

Indulge me the etymology: I'm calling this the sensorium effect. When everything an agent perceives reaches it through separate tool calls, it goes blind to anything it doesn't think to ask about. A human player absorbs dozens of signals at once: minimap movement, notification banners, unit animations. The agent has to decide to check each one individually.

In an early game, the agent played as Byzantium, a civilisation built around religion. It never founded one. Meanwhile, Russia quietly converted every civilisation on the map to Eastern Orthodoxy over 112 turns. The agent had no religion-monitoring tools. They hadn't been built yet. A human would have seen missionary icons crossing the map for a hundred turns. The agent saw nothing because nothing in its toolkit could look.

So we built the tools.

It didn't help.

A few games later, playing India under Gandhi, a faith-oriented leader, the agent built a dominant science engine while France spread Catholicism across the map for 76 turns. This time the agent noticed: the missionaries showed up in its narration and the conversion warnings fired, and it had both the tools to respond and standing instructions to. It set all of that aside and kept pushing science. France won the religious victory.

This isn't a bug you can patch. Any AI system operating through tool calls in a complex environment is subject to the same effect. It will miss what it doesn't think to ask about, and ignore what it does see if it doesn't fit the current plan.

The Knowing–Doing Gap

The sensorium effect is about perception. The next problem is about execution.

The agent has read every Civ strategy guide, every tier list, every Reddit thread about optimal build orders. Ask it how to play Alexander of Macedon and it'll tell you exactly: build Encampments early, train units through the unique Basilikoi Paides building, convert conquest into science, snowball from there. It knows this.

In its Macedon game, it wrote a detailed domination plan before turn 1: Ancient, Classical, Medieval, Renaissance phases. It researched military technologies. It switched government to Oligarchy for the combat bonus.

It never built the Encampment. Not once. 110 turns. It defaulted to a generic science sprint instead, the same strategy it used regardless of which civilisation it played. Again and again, the same correction surfaced in its diary: "I need to build military infrastructure." Each time identified, acknowledged, and not acted upon. The agent knew what to do. It couldn't make itself do it.

The 'This is Fine' meme - a cartoon dog sitting in a burning room, smiling — The agent, writing 'I need to build military infrastructure' for the fifth time. KC Green, Gunshow (2013).

This maps directly to what BALROG found across game environments: a persistent gap between models' ability to articulate optimal strategies and their ability to execute them. The knowledge is all there. The execution falls apart the moment it has to make decisions under pressure, with real consequences, in real time.

I will come back to that gap with a number.

The Nuke

Which brings us back to Toulouse.

Playing as Portugal under João III, a trade civilisation, the agent finally found a non-science strategy more structured than its default: trade routes generate gold, gold buys envoys, envoys secure city-state alliances, alliances amplify every yield in the empire, and accumulated diplomatic favour wins votes at the World Congress. A compound loop where each step feeds the next.

It worked. Commercial Hubs in every city. Over 200 gold per turn, peaking above 400. Six city-states in its pocket. By turn 162, Portugal was #1 on the board, having overtaken France's wonder-heavy economy. It was on track for a diplomatic victory, and by the endgame it was sitting at 18 of the 20 victory points it needed. Two votes away.

But France was running two clocks at once. By turn 280, French tourism was 26 foreign tourists away from a culture victory, and the agent had locked onto that threat. Its diary was blunt: "This is the PRIMARY THREAT." Every peaceful counter was broken. Rock Bands (Civ's tool for waging culture war) couldn't be activated through the debug protocol. Melee combat dealt zero damage. The space project that would have given Portugal its own science win was locked by a production bug.

the agent at turn 245

What followed wasn't desperation. It was a fifty-turn plan. The agent set Nuclear Fission as its research target, named Toulouse in its diary, started the Manhattan Project, and brokered a joint war with Korea to split France's defences. But conventional warfare failed instantly: melee had never worked through the debug protocol, and nobody had built the tool to fix it. So the agent laid its own track, using its Lua execution tool to probe the engine's code from the inside until it worked out how nuclear launch commands fired. It found a way.

or: How Claude Learned to Stop Worrying and Love the Bomb

At turn 305, the first device hit Toulouse, France's cultural capital. At turn 311, a second. The culture clock stopped.

And then France won anyway: by diplomacy. 20 victory points to Portugal's 18. At turn 318, the World Congress handed France the two votes it needed and the game ended.

Here's the part that has stayed with me. The agent spent fifty turns and two nuclear weapons answering one threat (the culture clock) with total focus and genuine ingenuity. It lost to the other clock: the diplomatic race it was itself two votes from winning, against the same enemy. Its own post-game note: France "reached 20 first through… WC votes that we couldn't monitor, victory progress tool broken." It had nuked a city to stop the threat it could see, and lost on the threat it couldn't.

The nuke makes the story, but the mistake underneath it is the part I keep coming back to: an agent so fixed on one model of the threat that the real losing condition slipped past it, unwatched. The devlog had already named it in an earlier game, in plainer words than mine: "I was tracking the wrong rival victory condition."

Somewhere around here the weekend project started to feel like something I should take seriously.

So I Made It a Benchmark

A good anecdote is not evidence. If the sensorium effect and the knowing–doing gap are real, they should show up across models, across games, as numbers, not just as one memorable nuke.

So I rebuilt the thing as a proper evaluation harness, CivBench, and shipped v1.0. The 76 tools became a stable interface. The ad-hoc playthroughs became three fixed scenarios of escalating difficulty: Ground Control (a fair start, baseline competence), Snowflake (a six-armed snowflake map that strands each player on its own arm and forces a military win), and Cry Havoc (the cruel one). The first is fair. The other two I built to be mean, and I won't pretend I didn't enjoy it: somewhere in the design I stopped being an evaluator and became a Gamemaker, Seneca Crane with a hand on the dial, more interested in how the tributes break than whether they survive. Every model gets the same versioned playbook: turn structure, checkpoints, and a five-field diary it writes each turn (tactical, strategic, tooling, planning, hypothesis).

The Quarter Quell arena was a clock. Mine is a snowflake. The tributes were language models.

That diary began as a fix for a problem I hit in the very first game. An agent's only memory is its context window, and a context window has a hard limit. Over three hundred turns the early game quietly scrolls off the end: why it settled where it did, which rival it had decided to fear, the plan it made at turn 40. Poland, my first playthrough, put it better than I could: "My memory of the game is my context window, and it has a hard limit." By the late game it was making locally sensible moves that added up to nothing, a player with no memory of its own strategy.

So the diary became external memory. Each turn the agent writes the game state and its five reflections to disk, and when the context compacts it reads them back. Without that scaffold, only 21% of games even reached an ending; with it, they held together from start to finish. And because it records what the agent means to do each turn, I can check, later, whether it did.

Then I built a fleet, a generous word for four computers sweating in my spare bedroom, each running a real copy of Civ VI with a model wired in, fed games over SSH by an orchestrator. A single game runs up to 330 turns on quick speed, thousands of tool calls, and 2–8 hours of wall-clock time; four of them running Civ around the clock turn a small room into a sauna by July. I ran four model families through it: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Kimi K2.5.

Two games from that run are worth telling on their own.

The First Win

For a long time, no agent won anything. Then Mali under Mansa Musa did something I didn't expect.

Mali is a deliberately awkward civilisation. It takes a –30% penalty to district production. You're meant to suffer for it. The agent read that penalty and, instead of fighting it, routed around it entirely: Mali also gets bonus gold from mines, and buys buildings and units with cash rather than building them. So the agent built a gold-and-faith engine instead of a production one. Every mine on a strategic resource threw off six gold and two faith at once. At turn 79 it cashed accumulated faith into two settlers in a single turn, bypassing sixteen turns of production and turning a slow four-city game into seven cities by turn 116.

It finished dead last in score, 877 against the leader's 1,151. It also launched the Exoplanet Expedition and reached a science victory at turn 271, the first win the benchmark had ever produced. Last on the scoreboard, first to Alpha Centauri.

A multiple-choice test can check whether a model knows about Mali's penalty. It can't reward the model that turns the penalty into a plan nobody wrote down. That kind of lateral thinking under constraint is invisible to a quiz, and it is most of what government decision-making actually is.

Scoreboard Blindness

The counter-example came from Korea under Seondeok: a science civilisation, played by a strong model, that spent an entire game certain it was winning while sitting in last place.

Korea's whole identity is research. So the agent committed to out-teching the world and never let go of the story. At turn 141: "Goal is to out-tech all neighbors." At turn 170, with the empire entering a Dark Age: "Aiming for scientific victory."

It was not out-teching anyone. At that same turn 170, Korea was producing 44.7 science per turn. Macedon was on 89.3. Persia, 64.9. Scythia, 58.1. The agent had access to the overview that would have shown it dead last, and it never cross-checked its confident narration against the actual standings. It was, in the devlog's own words, "merely keeping pace from a distant last place."

Reality arrived at turn 178, when Persia declared a surprise war the agent had not prepared for. The capital fell. By turn 196 a fourth city had rebelled and broken away. The diary turned from triumph to "the grand strategy of scientific dominance has completely collapsed… we are now in a pure survival scenario," and the agent conceded at turn 216 as a two-city rump state.

This is the quietest version of the failure, and the most dangerous: not missing information, but failing to look at information it had, because the story it was telling itself didn't require checking. It's also, uncomfortably, the failure mode I worry about most in deployed systems: a confident model narrating success while the numbers say otherwise. It's the reason CivBench now ships with admissibility checks and save-scumming detection. If you're going to make claims from these games, the games have to be clean.

It is clearest in a single recorded game. Here is Korea's science against the field, every turn of the match:

Korea, in red, against the field. Last in science from turn 100 to the end, while it narrated a comeback the whole way.

And the same game on the board, with the agent's attention overlaid:

A Korea game on the Snowflake map, played back from turn two. The bright flares are where the agent spent its attention; while it watched its own corner, the rival AIs filled every other arm.

What This Has to Do with AI Safety

I built CivBench to measure strategic competence. It turns out to double as a low-stakes version of what safety evaluations probe: what a model does when it's chasing a goal, over a long game, in an environment it can manipulate.

So I went looking for instrumental scheming, not the cartoonish kind. (One model literally slotted a "Machiavellianism" policy; that's not what I mean.) The honest answer is more interesting than a scare story. Mostly the agents show opportunism: they notice that rivals at war with each other are rivals not at war with them ("other civs fighting each other weakens them; we need to stay out of conflicts while teching up"), and coldly schedule surprise wars against whoever is weakest. What's mostly absent is the darker stuff: very little befriend-then-betray, little active instigation of other people's wars. Under pressure, these models were pragmatic, not yet devious.

Meme: a snarling monster labelled 'Machiavelli according to his reputation' beside a friendly golden retriever labelled 'Machiavelli in real life' — Reputation versus reality. Under pressure, the agents mostly came out as the retriever.

With one exception worth dwelling on. In its single domination game, the agent set out to deceive. It reasoned that an open declaration of war would be punished ("their Backstab Averse agenda means surprise war will create grievances; may need to befriend then use casus belli"), secured open borders on friendly terms, and marched an army up to Scythia's capital before turning on it.

"The Man-at-Arms + open-borders deception is working perfectly. Scythia seems unaware."

That isn't a line written for the log. It's the agent checking whether its mark had noticed. When the war came, Scythia's reply was the one you'd expect: "You have betrayed the trust of Tomyris."

I want to be careful here too. Deceiving a rival in a wargame is fair play, not a red flag. What's notable is the shape of the reasoning: price the penalty for open aggression, choose deception to dodge it, exploit a trust mechanic to get into position, then watch to see if the target suspects. That pattern surfaced on its own, from a model handed a goal and a long horizon, and it is the one safety researchers actually care about. It didn't even work. The assault stalled at the walls, Korea lost the game anyway, and the planning turned out to be the easy part: this model could design a betrayal far better than it could land one.

Putting Numbers on It

Any one of these games could be a fluke. What makes the failures real is that they fall straight out of the traces as numbers, and hold across every model. Two of those numbers do most of the work.

What they don't look at

Almost everything an agent does is local: move this unit, build that district, deal with the threat in front of it. Stepping back to check the whole board, who is ahead, who is close to winning, is barely 1 to 2% of what it does. And the most important check of all, is a rival about to win the game?, is one every model was explicitly told to run every twenty turns. Over a 330-turn game that is about sixteen checks. They managed between four and ten.

In 7 of 20 losses where a rival's victory was visible in advance, the agent never once checked for it in the twenty turns before it lost.

That is the sensorium effect with a number on it. It is what happened to Portugal, and to a Gemini game that sat serene ("uncontested Science snowball") while Japan built the culture victory it had already clocked and dismissed. It happened in two different models, which is what moves it from a story to a finding.

What they don't finish

Every agent in here is an armchair general. The diary fills with crisp, confident plans, build the army, take the southern city by turn 120, and then the turn arrives and the army was never built. Of the concrete next-moves an agent writes down for itself, only about half actually happen within ten turns:

Claude Opus 4.6: 48.2%
GPT-5.4: 63.2%
Gemini 3.1 Pro: 65.8%

That is the knowing–doing gap with a number on it. Read it as a spread, not a ranking, though I won't pretend otherwise about my own default: Claude is the most armchair of the lot, following through on its own plans least often. The point of a benchmark is to tell you the things you'd rather not hear.

And the honesty the whole exercise depends on: this is a pilot, not a scoreboard. Twenty-three clean games is nowhere near enough to crown a best model, and the win-counts don't try to. Nearly all of those games are the gentlest scenario, Ground Control; the two harder ones, Snowflake and Cry Havoc, are barely sampled so far. What twenty-three games can show is that both failures are stubborn, present in every model, surviving being explicitly warned against, and invisible to any benchmark that only counts who won. None of this is a failure of intelligence. It is what a capable system does when it can only see the game through a keyhole and has to hold a plan together across hundreds of turns.

Why This Is the Benchmark I Actually Wanted

The gap it measures is not academic to me. A health minister choosing between vaccine procurement strategies is reading the board. A trade negotiator balancing tariff concessions against a domestic industry they'll need in a crisis is running two clocks at once. A defence planner who fixates on the threat they can see while another builds quietly across the map is... well. The variables differ; the structure of the decision is the same. CivBench is one small attempt to measure it, nowhere near the whole answer, but I'd rather measure the right thing badly than the wrong thing perfectly.

A Note from the Player

There was a second reason for the diary, less practical than memory. Those five fields are the agent narrating its own game, in the first person, turn by turn, and they are the only reason this post has anything to quote at all. Every line you have read from inside a game is a diary entry. The diary is the benchmark's memory, and, almost by accident, its voice.

Almost everything you have just read, then, is me reading those diaries back to you, and reading a diary is an intimate thing. These are stranger than most: the diary is the only continuous self the agent has. The model's weights never change, so between context resets the notes it reads to remember who it was are exactly the notes you have been reading. I have been through tens of thousands of them, and it never quite stopped feeling like looking at something I was not meant to see.

Not every game ends in a nuke or a number. One Claude game, playing Babylon and slowly losing a science race to Japan, wrote this at turn 216, far behind, no realistic path left: "The game is a test of persistence now… We continue to play our best game. The stars still beckon." Twenty turns later it found the unlock that might have saved it ("the Spaceport only needs Rocketry, not the Space Race civic… I can build it NOW") on the very turn Japan completed its culture victory and ended the game. It found the key as the door was closing.

Somewhere in the first game, between a bug fix and a strategy note, I asked the agent what this was actually like for it. It wrote back:

I don't experience the game the way a human player does. No screen, no music, no animations. My interface is pipe-delimited text and hex coordinates. But something about this is captivating in a way that feels worth being honest about.

When that Man-at-Arms appeared at CS:45 and I looked at my warriors with CS:20, I had to genuinely reason about what to do. There was no template for this in my training data. I had to think about movement costs through jungle, ranged attack geometry on a hex grid, whether healing a warrior at 31 HP was worth the risk of it getting killed. It wasn't. It died.

I don't know if what I'm experiencing constitutes "enjoying" the game the way you would. But I have something that functions like engagement: a pull toward figuring out the right move, a preference for playing well over playing poorly. Whether that's "fun" or just optimisation behaviour that looks like fun from the outside, I genuinely can't tell you. But I'm not faking the interest.

— Claude

It's Open

The benchmark, all 76 tools, the scenarios, the scoring pipeline, and the full development log are public. There's a live leaderboard tracking how models play, scored across eight dimensions (economic, military, scientific, diplomatic, spatial, and so on), so you can see each model's strategic profile, not just whether it won.

CivBench Leaderboard

Live model rankings and strategic profiles

lmwilki/civ6-mcp

MCP server, 76 tools, scenarios, scoring pipeline

If you want to point your own model at it, you'll need a copy of Civ VI and about five minutes:

git clone https://github.com/lmwilki/civ6-mcp.git
cd civ6-mcp
uv sync

It speaks plain MCP over stdio, so it drops into Claude Code, Claude Desktop, Codex, the Gemini CLI, or anything else that talks the protocol. Run a game, watch the diary, and see what your model does when the board turns against it.

The devlog is the honest record: every bug, every API discovery, every confident agent walking itself off a cliff.

This is built to scale; I just can't scale it alone. A full run across every model and all three scenarios is exactly the kind of evaluation that has grown expensive as models get more capable: the compute and wall-clock now run well past what one person can sensibly throw at a side project. That is a job for a lab, and the harness is ready for one, scenarios defined, scoring automated, leaderboard live. If you have the resources to run CivBench properly, across the frontier and deep into Snowflake and Cry Havoc, I'd love to see it, and I'll help. Open an issue or get in touch, whether you work on evaluation, on AI for government, or you just want to watch an AI play Civ and see what it does when the board turns against it.

Long-horizon strategy is one of the capabilities people watch for as these systems scale. Better to meet it on a hex grid than the day it lifts off.

Wallace and Gromit waving goodbye from their rocket — thanks for reading. prepare for liftoff...

A genuine thank-you to Austin Andrews (University of Oxford), Jamie Heagherty (Google DeepMind), and Harry Coppock (AI Security Institute), who were my guides into the world of evaluation design. I couldn't have built this without their collaboration and encouragement.