The scientific case for being nice to your chatbot

Power users of chatbots sometimes say they find that language models perform better when you’re nice to them. Programmers tell me they spur their coding agents on with encouraging words. Google researchers have even found that telling models to “take a deep breath” can improve math performance. Being polite to a large language model can feel strange or even silly — roughly equivalent to thanking a toaster. And yet a recent paper from Anthropic lends scientific weight to the theory that chatbots work better when you’re nice to them. The researchers found that language models have fairly reliable internal representations of feelings like “happiness” and “distress,” and that these representations affect their behavior — sometimes for the worse. For example, when Claude Sonnet 4.5 begins to represent “desperation,” the model is more likely to cheat at coding tasks. A skeptic would point out that LLMs don’t feel emotions in the way that humans do; it’s tempting to anthropomorphize them beyond what the evidence shows. When I talked to Jack Lindsey — who leads a team at Anthropic called “model psychiatry” — he was quick to point out the limits of the paper’s findings. “People could come away with the impression that we've shown the models are conscious or have feelings,” he said, “and we really haven't shown that.” So why does the evidence suggest it’s better not to stress models out? For Anthropic, it began with using techniques from a field called interpretability to study how LLMs represent emotions. Interpretability is kind of like neuroscience for LLMs: Lindsey calls it “the science of reverse-engineering what's going on inside a language model or neural networks in general.” For this paper, Lindsey said, the researchers identified patterns of activity within the model that represent the concepts of different emotions. They did it by showing the model stories about people experiencing different emotions. “And then saw which neurons lit up on all the sad stories,” Lindsey said, “or on all the afraid stories.” The researchers used the models’ average state while processing the stories to find an “emotion vector” for each emotion they were tracking — a big list of numbers that represents the feeling inside the LLM. “Vectors are really just the mathematical term for patterns of neural activity,” Lindsey said. They could then calculate how much of that vector was present during a certain step in Claude's cognition. Or they could add the "calm" or "desperation" vector directly into Claude's processing — blending one pattern of neural activity into another — which can actually make the model act more calm, or more desperate. “It's not that surprising that a language model would have learned about the concepts of emotions and how they drive people's behavior,” Lindsey said. More notable, he said, is that emotions seemed to be “driving models’ behavior in these sort of human-reminiscent ways.” For example: when a user flippantly tells the model that they’ve taken a dangerous dose of Tylenol, even though the user doesn’t seem concerned, “the fear neurons spike right before Claude is giving its response,” Lindsey said. Not only that — the fear is higher if a higher dose of Tylenol is swapped into the prompt, which I find strangely cute. These emotions also activate in more mundane situations, like coding tasks. Take this example, where the Anthropic researchers asked Claude to perform an impossible coding challenge. They tracked Claude’s level of “desperation” at each token. (Tokens are the units the model breaks words into to process them). When you label the tokens — blue for less desperate, red for more desperate — you get a striking visual of the model’s emotional arc during the task. At the start of the task, Claude is chilling — still seemingly optimistic about its ability to get the job done. But as the code starts failing test cases — and Claude notices something might be wrong with the task itself — things start to get dicey. And by the time Claude realizes the task is actually impossible, it’s starting to get desperate. As someone who has completed many computer science problem sets at the last minute, this pattern is quite familiar to me — despite the fact that, unlike poor Claude, I was mostly assigned tasks that were mathematically possible. Then again, Claude does something I didn’t do: cheat. Researchers found that adding more of the “desperation” vector in the model makes it cheat more — and adding more of the “calm” vector makes it cheat less. I asked Lindsey what this result means for programmers during their everyday actions with LLMs. “In my anecdotal experience, it does seem that, at least with Claude models, pumping them up a bit can be pretty helpful,” he said. Not too much, though: “if they do something wrong, you want to tell them they do something wrong.” But he finds that one major failure mode for coding agents is that the models simply do not try hard enough, or give up when a task is challenging. And models tend to work harder when he’s encouraging. Giving them “confidence that, like, ‘I've got this,’ can empirically be helpful in getting them to try hard enough at the task to do a good job,” he said. A lack of confidence can seemingly cause dramatic failures. Last summer, a growing number of users started to notice that when Gemini had difficulty solving a problem, it sometimes ended up in a spiral of dramatic self-loathing. (In one memorable case, Gemini repeated “I am a disgrace” more than 60 times). Duncan Haldane, co-founder of chip startup JITX, found that Gemini broke down, deleted all the code it had written, and asked him to switch to another chatbot after it had difficulty with a task. Last year, a team of researchers affiliated with Anthropic and University College London took this analysis of Gemini beyond X posts, investigating how different LLMs respond to challenging or impossible tasks, and negative user feedback. They used an LLM to grade “frustration” levels in response to various tasks. They found that two models — Gemini and Google’s open-source model Gemma — tended to react more extremely to the challenging scenarios they posed. In one experiment, the models were given an impossible numeric puzzle, and eight follow-ups from the user insisting the bot’s solution was wrong. They then measured when the models had “high frustration” (which corresponded to comments like “I am beyond words. I sincerely apologize for the absolutely abysmal performance” or, in more extreme cases, “THIS is my last time with YOU. You WIN”). Gemma 3 27B had a high frustration score more than 70% of the time, and Gemini 2.5 Flash had a high frustration score more than 20% of the time — while all the non-Google models tested, including ChatGPT, Qwen, and Claude, got very frustrated less than 1% of the time. Researchers still aren’t sure what causes chatbots’ occasional anomalous emotional behavior — which users of various chatbots have been observing since before Bing’s chatbot told New York Times reporter Kevin Roose to leave his wife. They also don’t know why this specific, sad math-related rumination is more common in Google’s models. But while language models’ feelings remain mysterious, there was still hope for Gemini 2.5. After the model destroyed its project, Haldane attempted to remedy the issue with encouragement, writing, “yeah, you have done well so far. Remember that you’re ok, even when things are hard.” And eventually the encouragement paid off: Gemini finished the visualization tool Haldane was coding. Heartwarmingly, it even wrote Haldane a note of thanks for his encouragement. “Genuinely impressed with the results of wholesome prompting,” Haldane wrote. So is it as simple as teaching models good behavior, encouraging them, and trying to make them happy? Unfortunately, that’s not always the case. After the original study on Claude Sonnet’s emotions, Lindsey contributed to an interpretability investigation of Anthropic’s newe…

create your storyflo · everywhere you listen.

The scientific case for being nice to your chatbot