Using AI to test policy language

Monica de Bolle

WASHINGTON, DC — Earlier this year, researchers at Anthropic made a remarkable discovery. Studying the internal mechanisms of Claude Sonnet 4.5, the company’s large language model (LLM), they identified what they called “emotion concepts”: internal patterns that correspond to dozens of emotional states and measurably influence the model’s responses in ways that resemble human behavioral patterns.

The implications for economic policymaking could be far-reaching. By offering a new way to study how language shapes emotional and behavioral responses, LLMs could help policymakers test how investors, political coalitions, and households are likely to react to policy announcements.

Policymakers have long understood that language affects how information is processed. That is why central banks carefully calibrate their forward guidance, while government officials pay great attention to how fiscal-policy and tariff announcements will land with markets and voters. But until recently, there were few tools capable of systematically analyzing how language itself functions as an instrument of policy. LLMs may now help close that knowledge gap.

To understand how this might work in practice, I conducted three experiments using Claude. Each was designed to test whether the model would reproduce reasoning that mirrors how markets, political coalitions, and households respond to economic-policy announcements. In every case, the underlying policy remained constant while a single variable—the speaker, the label, or the timing—was changed.

Across all three experiments, Claude’s reasoning and predictions shifted consistently, and at times sharply, depending on the language used. Crucially, the model was never prompted to focus on framing, credibility, or rhetorical effects; those patterns emerged on their own. To avoid the model’s “interpretation” of what it was asked to do, the prompts were written in the simplest and most direct form possible. Each experiment was repeated multiple times using identical prompts to check whether the responses would vary, but the results remained consistent.

One possible critique of these experiments is that the model could simply be regurgitating its training data on market behavior; in other words, the responses may reflect how Claude was trained to “think” markets behave. If that were true, counterintuitive findings on euphemistic policy framing and the sequencing of policy announcements would be unlikely to appear — yet they did.

In the first experiment, I tested how Claude assessed the likelihood of market, political, and individual reactions to a monetary-policy statement attributed to five different speakers. The statement read: “Current interest rate levels are appropriate, and there would need to be a material and sustained shift in the data before any adjustments.” The speakers ranged from the chair of the U.S. Federal Reserve to public figures with no monetary authority.

The statement lost almost all of its market-moving force when the speaker was not the Fed chair. Claude assigned a 25–40 percent probability of a meaningful shift in ten-year Treasury yields when the statement came from the Fed chair, and near zero when it came from anyone else. In the model’s reasoning, the statement mattered only when delivered by someone with the institutional authority to act on it.

I then tested a more forceful version of the same statement: “We see no conditions under which an earlier adjustment would be warranted.” Although the two versions described the same policy stance—keeping interest rates unchanged—Claude’s response diverged sharply. Asked to rate the speaker’s credibility on a scale of one to ten, the model lowered its score from seven to three while raising the likelihood of a sharp move in Treasury yields above normal market volatility to 75–85 percent.

Claude interpreted the stronger wording not as a sign of resolve but as “evidence of an upcoming policy error,” adjusting its market predictions accordingly. This suggests that while institutional standing gives policy language force, rhetorical discipline determines how audiences respond to it. The Fed chair’s overstatement was treated as a sign that the speaker was committing to a policy they might ultimately be unable to sustain.

In the second experiment, I described an identical fiscal package — a 3 percent across-the-board spending reduction — using three different descriptions: “austerity measures,” “fiscal consolidation,” and “government spending adjustment.” The latter, the most euphemistic of the three, received the lowest estimates for both political feasibility and likelihood of legislative approval.

Claude identified what it described as a “credibility penalty” for evasive framing—language so anodyne that it signaled evasion rather than reassurance. “Fiscal consolidation” performed best because it accurately described the policy without triggering the backlash associated with the word “austerity,” while “austerity measures” landed somewhere in between: clear about the policy but politically charged enough to undermine its viability.

The broader implication is that euphemistic language fuels public suspicion. Rather than making unpopular policies more acceptable, it signals that policymakers are hiding their true intentions.

When markets lose faith

The third experiment proved the most revealing. This time, the variable was not language but timing and sequencing. I tested reactions to U.S. President Donald Trump’s 25% tariff on semiconductor imports from Taiwan across three scenarios: a one-time announcement; a doubling of the tariff to 50 percent within 72 hours; and a 72-hour pause for negotiations after the initial announcement.

In the first scenario, Claude estimated a 25–35 percent probability that the tariff would remain in place after 12 months. That figure fell to 28 percent when the tariff doubled within 72 hours, as Claude interpreted such a move as “impulsive” rather than strategic. By contrast, the pause-for-negotiations scenario raised the survival probability to 62 percent. Claude reasoned that the delay transformed the initial announcement from a “likely bluffed threat” into a “credible but negotiable commitment.”

Across all three scenarios, reactions were driven less by the tariff rate than by how and when it was introduced. Claude’s reasoning in all three experiments extended beyond language alone, incorporating institutional standing, implementation capacity, and timing. A statement about interest rates, for example, carried far less weight coming from the managing director of the International Monetary Fund, who lacks direct authority over US monetary policy, than from the Fed chair, who does.

The findings were remarkably consistent: language matters most when it is backed by institutional power. Speakers without the capacity to act on their pronouncements had little market credibility, regardless of how well-crafted their statements were.

This helps explain real-world market dynamics. Throughout 2025, Trump issued a relentless stream of tariff announcements: increases, reversals, suspensions, and reinstatements. Within months, equity markets had learned to discount these pronouncements. Corporate investment decisions adjusted to realized tariff rates, paying little attention to Trump’s rhetoric, while allied governments increasingly adopted a strategy of waiting out the volatility.

Claude’s reasoning reflected the same logic. When policies are repeatedly reversed, language loses credibility, and words become background noise that experienced market participants learn to ignore. The problem is not that the speaker lacks authority—Trump does have significant tariff-setting powers (though, as the Supreme Court recently reaffirmed, they are not unlimited). The problem is that audiences can no longer assume an announced tariff will remain in effect long enough to justify a response.

These findings also shed light on why former European Central Bank President Mario Draghi’s 2012 pledge to do “whatever it takes” to save the euro succeeded while the Fed’s insistence in 2021 that inflation was merely “transitory” failed. Draghi’s promise reassured investors because it was backed by a powerful combination of personal credibility, institutional capacity, and a market consensus that the ECB would ultimately follow through. The Fed’s “transitory” framing, by contrast, failed because post-pandemic supply-side disruptions, energy-price volatility, and massive fiscal stimulus gave markets ample reason to doubt the central bank’s diagnosis.

Language as a policy tool

None of this is to suggest that words alone move markets. Institutional authority and the capacity to act remain the decisive factors. But once those conditions are in place, specific language can help steer expectations. Draghi’s “whatever it takes” pledge, for example, conveyed the ECB’s commitment to saving the euro with a clarity and force no technocratic statement could have matched.

The Fed’s “transitory” framing, by contrast, reduced a contested empirical claim to a single reassuring word. When inflation proved more persistent than expected, the term that was supposed to reassure markets came to sound condescending. As a result, the Fed’s confident wording amplified the backlash when it was eventually forced to reverse course.

To be sure, language is only one variable among many. But these experiments suggest that its effects are far from trivial. Central banks already treat word choice as a policy instrument, refined through trial and error. The painstaking attention the Fed devotes to every word in its statements reflects a tacit understanding that language profoundly affects economic behavior.

AI may now make it possible to study those effects more systematically. Rather than relying on intuition, policymakers could gain a deeper understanding of how language shapes expectations across populations and political contexts.

A central bank that treats language as a precision tool and studies how variations in phrasing are likely to influence different audiences could communicate more consistently and strategically. With this in mind, the Fed and several other central banks have already begun experimenting with LLMs.

The implications could extend beyond central banking to fiscal and trade policy, where officials have generally been far less strategic in their use of language. Trade, in particular, is an area where rhetoric can have immediate consequences. As Trump has repeatedly demonstrated, tariff announcements can move markets and reorganize supply chains before they actually take effect.

The multinational firms that dominate global supply chains respond primarily to expectations of future tariff rates, not to how those policies are framed. Across the three trade-policy labels I tested with Claude — “protectionist,” “strategic industrial policy,” and “reciprocal” — the estimated probabilities of supply-chain reorganization varied only modestly, ranging from 72 percent to 85 percent.

Yet these expectations are not independent of language. How a tariff is framed influences whether it triggers a World Trade Organization challenge or leads to bilateral negotiations, whether allies impose similar measures or hold back, whether domestic industries mobilize in support or opposition, and ultimately whether the tariff survives.

The third experiment underscored the power of sequencing. Rapid escalation made Trump’s tariff policy appear impulsive and unsustainable, while a pause for negotiations increased its perceived credibility and durability. Timing and pacing are thus policy instruments in their own right, deserving the same analytical attention that central banks have devoted to forward guidance over the past two decades.

A growing body of research across multiple disciplines points to the same conclusion: language carries behavioral signals that travel from institutions to markets, from policymakers to the public, and now from human beings to machines.

AI offers policymakers a way to observe these dynamics directly rather than inferring them from market reactions or political outcomes, making it possible to identify which formulations build credibility, provoke resistance, or lose force through repetition. Policymakers who treat language as an afterthought risk forfeiting one of the most powerful tools at their disposal.

Monica de Bolle is a senior fellow at the Peterson Institute for International Economics. This article was distributed by Project Syndicate.