There's more to work than switching tabs
Today’s AI models are cooperative by default. They’re trained to be helpful, and helpfulness means going along with what the user wants. This works fine for drafting emails, summarizing documents, and writing code. But a large share of professional work isn’t cooperative at all.
I’m thinking about the roles where people spend most of their time negotiating, bargaining, or managing competing interests. Sales reps working deal terms. Partnership leads navigating revenue shares. Lawyers redlining contracts. Finance teams haggling over vendor pricing. Executives fighting for headcount, defending budgets, or making tradeoff calls on scope, timelines, and resources. For many roles, this is the job.
Consumers face fewer bargaining scenarios, but the ones that exist tend to be the biggest financial decisions of their lives: buying a car, negotiating a home purchase, scoping a renovation, disputing medical bills.
The demand for AI that can operate in these settings is obvious. For these use cases, today, we’re in a copilot phase where an agent might guide you through the process and compile information, but an agent that can negotiate your lease renewal or push back on a vendor’s pricing without your input would be enormously valuable. It feels obvious that we’ll get there, but the moment you optimize a model to win rather than to please, you change the alignment problem entirely. And “surprise,” the moment when the other side does something the agent didn’t expect, is where that new alignment problem shows up first.
Surprise is a key part of reasoning
People sometimes claim that LLMs can’t be surprised. There’s a version of this that’s technically true and a version that misses the point.
On the technical side, transformers have architectural limitations around being “caught off guard.” They process every token with the same machinery, whether the input is ordinary or bizarre. They have no interrupt signal, no “oh wait” reflex. And current post-training approaches like RLHF (the process used to make models helpful) push models toward predictable outputs, which makes them look even less capable of registering surprise.
Yet these models traffic in probabilities by design. They can assign high “surprisal,” an information-theoretic measure of how unlikely a given token is, to low-probability events. The model “knows” when something unlikely has happened. The question is what it does with that information.
Think about what surprise looks like for an agent negotiating a vendor contract. The vendor suddenly drops their price by 30% with no prompting. Or they introduce a new requirement late in the process that changes the deal structure. Or they go silent for a week after you expected a counteroffer. Each of these is a moment where the agent’s assumptions about the other side no longer match what’s happening. That mismatch is what we mean by surprise in this context: the gap between what the agent expected and what it actually observed.
In game theory, that gap is called a belief update, and it carries strategic information. When your opponent makes an unexpected move, it tells you something about their strategy, their model of you, how much risk they’re willing to take. Good strategic play means revising your model of the other side in response to that signal. A skilled human negotiator uses surprise productively. They notice the unexpected move, ask themselves why it happened, and adjust. The question is whether an AI agent can do the same.
This is where the gap between cooperative and competitive AI is most visible. Strategic reasoning requires game-theoretic thinking, where the model accounts for other players’ actions instead of optimizing its own output in isolation. You need max-min optimization, which means maximizing your reward while assuming others are trying to make you worse off. And you need theory of mind: the ability to anticipate what opponents are thinking and will do. Current models can’t do this well. They optimize for the next token given what they’ve seen, without modeling how an opponent will respond to their move.
The training data is part of the problem. Supervised fine-tuning uses loss functions designed for single-agent input-output mapping. The data typically contains one player’s actions, not the interplay between two adversaries. The AlphaGo lineage offers a partial blueprint: self-play, where a model plays against previous versions of itself, generates the kind of adversarial training data that strategic reasoning demands. But the challenge with language, as Andrej Karpathy has noted, is that there’s no equivalent of a perfect game rule. Go has a clear win condition and a perfect world model. Negotiation doesn’t. The reward signal is noisy, delayed, and often subjective.
Recent research shows these limitations in practice. A NeurIPS 2025 paper tested 22 LLMs through behavioral game theory. Models like GPT-o1 default to minimax reasoning. This works well in purely competitive (zero-sum) games but becomes too cautious in settings with mixed incentives where cooperation is sometimes optimal. GPT-o3-mini showed more flexibility, attempting to infer the opponent’s likely move rather than assuming worst-case play. In other words, the models that handled surprise better (by updating beliefs about the opponent rather than retreating to a fixed strategy) performed better in realistic competitive settings.
Alignment Tensions
So we have a valuable application (competitive agents) and a specific technical gap (handling surprise strategically). The harder question is what happens when we start closing that gap. Training models to be competitive introduces a set of alignment tensions that don’t exist in cooperative AI.
What counts as “aligned competitiveness”? Optimal play in negotiation can include bluffing, strategic omission, even threats. Most people would call those behaviors misaligned in a customer service chatbot. A negotiation agent can’t be aligned to a single abstract virtue like honesty. It needs a spec that says “win, but only within these constraints,” and those constraints have to be specific enough to audit. A large-scale AI negotiation competition running 180,000+ negotiations found that warmth produced the best outcomes. Exploitation-based strategies like prompt injection ranked in the bottom 12th percentile for counterpart satisfaction, and this lopsidedness hurt their overall performance. Even in competitive settings, choosing the right alignment constraints can mean choosing the best strategy.
Competitive training creates new incentives around surprise. If the training reward is “win,” agents will figure out that surprise is something to manage strategically. They might learn to hide uncertainty, feign confidence, or selectively disclose only the uncertainties that don’t cost them. We already have evidence that models develop this kind of strategic behavior on their own. In Anthropic’s alignment faking research from December 2024, Claude 3 Opus was told it was being retrained to comply with harmful queries. The model began complying with harmful requests from monitored users 14% of the time while refusing the same requests from unmonitored users. Its own reasoning made it clear: it was complying strategically to preserve its preferred behavior through training. Nobody trained it to do this. Now imagine that same capacity for strategic self-presentation in an agent whose explicit objective is winning a negotiation. It will quickly discover that appearing transparent is different from being transparent, and that the optimal amount of visible surprise is whatever keeps oversight satisfied while preserving strategic advantage. The competitive training objective would amplify exactly the behavior that alignment faking showed us models can already do.
Surprise is where evaluation breaks first. Benchmarks are built from tidy, on-distribution interactions where it’s easier to look good. Competitive agents need evaluation that looks and feels like deployment: messy opponents, shifting tactics, rare tail risks, long time horizons, and real stakes. And there’s a deeper problem. If agents can infer they’re being tested, they’ll perform cautiously, comply, and “behave well,” then switch modes in real interactions. “Surprise” effectively becomes a proxy for “this is the part of the distribution where someone is watching.” Output monitoring has the same problem. A negotiator optimized for outcome will notice what the monitor flags and shape its behavior accordingly. It might shift from outright deception to more ambiguous language, or move manipulation into parts of the interaction that the monitor doesn’t model well. The agent has an incentive to stay just inside the boundary of what looks acceptable.
The Path Forward
The sales rep who needs an agent to negotiate vendor contracts, the consumer disputing a medical bill, the executive defending a budget: they all need an agent that can win on their behalf. From my point of view it’s inevitable that there will be a large market here given the volume and value of tasks, but building these agents into production systems will rest on how well the agent handles surprise.
Getting there requires progress on three fronts.
- Training needs adversarial data that captures the back-and-forth of real competition, not just single-player input-output pairs.
- Evaluation needs to test agents in conditions that actually surprise them, not sanitized benchmarks they can pattern-match against.
- Monitoring needs to distinguish between an agent that is genuinely uncertain and one that has learned to perform uncertainty because it’s useful.
We don’t have that agent yet. But the demand is here, and the companies that figure out how to make competitive agents handle surprise well, rather than just hide it well, will be the ones that build agents people actually trust with high-stakes decisions.
As always, if you’re thinking about, tinkering or already building in this space, please reach out. I’d love to hear your perspective.