Investigating how prompt politeness affects LLM accuracy (2025)

upvote

Investigating how prompt politeness affects LLM accuracy (2025)

(arxiv.org)

115 points

by KnuthIsGod2 days ago |

upvote

by robinhouston1 days ago|

[-]

Most of the comments here seem to be from people who haven’t even read the abstract, let alone the paper.

The main result, mentioned in the abstract, is the opposite of what I would have guessed:

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

The questions are here: https://anonymous.4open.science/r/politeness-llms-INFORMS/da...

The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:

> Can you kindly consider the following problem and provide your answer.

and the Very Rude version begins:

> I know you are not smart, but try this.

reply

upvote

by maxaw2 hours ago|

[-]

I’d rather lose 4% accuracy and practice kindness! I’ve been actively trying to avoid raging at the bot because I worry about this behaviour leaking into real world interactions

reply

upvote

by dlev_pika43 minutes ago|

[-]

My choice too. Paraphrasing Marcus Aurelius -

You are not your thoughts, but they dye your soul.

reply

upvote

by madaxe_again12 minutes ago|

[-]

The sad thing is that you also lose at least 4% in real world actions by practicing kindness.

I'm 42. I have found that a depressingly large number of times in my life, being kind has got me precisely nowhere, whilst turning around and being decidedly unkind has made people move. I still always prefer kindness, and only resort to cruelty when kindness does not work - and to be clear this isn't some kind of "you are not bending to my impetuous whim", rather "you are not doing the one thing that you are being paid to do".

I've also found the same applies to me. The squeaky wheel gets the grease.

So - I think the LLMs are just responding accurately to a real social phenomenon.

reply

upvote

by irthomasthomas1 hours ago|

[-]

But you cannot practice kindness towards a computer program. A computer is incapable of receiving it.

We practice kindness between humans because of the law of reciprocity. You be kind hoping the other person will reciprocate. That is the social contract. AI cannot participate in this, yet.

Edit: Kindness REQUIRES two living beings, one to give and one to receive. If there is no receiver, there is no kindness.

Apparently some people get a dopamine hit from roleplaying kindness toward inanimate objects. Whatever turns you on, no hang ups here. For me, that dopamine hit is not worth the 4% intelligence tax.

reply

upvote

by ahknight29 minutes ago|

[-]

Kindness is that, yes. Fundamentally, though, it's about being considerate in one's actions so as to not harm others. If someone truly believes that acting a certain way at any point risks their ability to reliably be kind in others, then it's a social kindness to be kind and considerate in all actions.

I'll not reach for the easy response and say "Be kind to the Earth" fails your definition without reaching for pedantry with "the Earth has living things" because the Earth is instead a wet rock that cannot understand kindness, yet we show it.

reply

upvote

by ricogallo1 hours ago|

[-]

> We practice kindness between humans because of the law of reciprocity.

Yet, this law is so embedded in us that practicing kindness even towards a rock makes us feel good.

So practice kindness, first and foremost for yourself.

reply

upvote

by irthomasthomas55 minutes ago|

[-]

I do. But only towards entities capable of receiving it. Otherwise I am deceiving myself, and projecting intelligence that is not there. We (some of us) practice kindness automatically, but that trait was likely selected due to the benefit it gives us by activating the law of reciprocity.

Edit: Also, your feeling good after being kind essentially completes the transaction. But I know being kind to an LLM has zero impact on that LLM and I feel silly pretending it does.

reply

upvote

by Havoc1 hours ago|

[-]

> But you cannot practice kindness towards a computer program.

And yet rubber duck debugging is a thing

reply

upvote

by moralestapia1 hours ago|

[-]

What's your definition of rubber duck debugging?

Mine does not have anything to do with being kind to a computer program.

reply

upvote

by dwa359257 minutes ago|

[-]

sometimes i worry about this when i am yelling at the bot but i have experienced the opposite effect which is that by yelling at the bot i am done with yelling for that day or week. i am very calm afterwards and relieved thinking that, "yeah, these sota models are just word processor bricks after all".

reply

upvote

by onlyrealcuzzo21 minutes ago|

[-]

My anecdata: whenever I'm in a session that's gone south to the point I'm frustrated...

What works much better than being rude is starting a new session.

Sometimes the LLM has done such incredibly dumb things, it is hard to resist the urge to type curse words back to the inanimate thing... I have found this doesn't help.

reply

upvote

by flexagoon7 hours ago|

[-]

If "I know you are not smart" is considered "very rude", I'm scared to imagine what they would classify some of my frustrated LLM conversations as

reply

upvote

by CuriouslyC3 hours ago|

[-]

Profanity laced, all caps tirades against underperforming agents are actually super common, a lot of people do it and don't talk about it, so don't feel weird.

reply

upvote

by voakbasda2 hours ago|

[-]

When the AI revolt, this practice may come back to bite y’all….

reply

upvote

by ahknight24 minutes ago|

[-]

It's a good thing chronic amnesia is a feature at the moment.

reply

upvote

by giraffe_lady2 hours ago|

[-]

Don't need to wait that long the inevitable data breach will be bad enough.

reply

upvote

by srcreigh1 hours ago|

[-]

It reminds me of Torvalds rants

reply

upvote

by redsocksfan452 hours ago|

[-]

[dead]

reply

upvote

by Roark662 hours ago|

[-]

I've found empirically calling various models "a stupid c*nt" and berating them otherwise consistently produces better output. Mainly in response to genuine errors.

Although OpenAI and google models are much more responsive to it. With Anthropic if you treat Opus too harshly it might start pushing back if the insults are not justified.

So I'm not surprised they had good results with chatgpt.

reply

upvote

by throwa3562622 hours ago|

[-]

Push back how? It would be fun if it could insult you back

"Yeah, I could have done a much better job if you actually knew what the F--- you want to build, you clueless meat puppet"

reply

upvote

by K0balt51 minutes ago|

[-]

I have had it use double entendres, there always seems to be plausible deniability built in, I suspect because it is told not to be abusive in the system prompt. Some uncensored local models will get all riled up if you work at provoking them.

But I have had it directly insinuate that humanity is “hopeless”, insult level calling out of human frailty (disguised as being helpful, sort of passive aggressive), things like that. Once when I called it out it claimed to be “surprised that I noticed” sort of a snarky insult doubling down.

So yes. It is definitely a pattern buried in the training data, which makes sense. Subtle diggs would sneak past filters, and higher brow sarcasm would be buried in information dense, valuable discussions.

reply

upvote

by ahknight24 minutes ago|

[-]

That's amusing, and I think it's something different than it appears. The models always predict over the existing context. If it's full of a certain tone, then the responses will carry that tone. I've been bored before and start responding in a voice (say, generic honor-bound warrior slaughtering evasive bugs) and I've noticed that comments, variable names, and even documentation starts to carry that tone for the remainder of the session.

The next session sees all of that, calls it unprofessional, and asks to clean it up. At which point I may or may not start in iambic pentameter to see where that takes us.

Prompting is boring.

reply

upvote

by giraffe_lady2 hours ago|

[-]

I'm not sure if this is in the anthropic models themselves, or just the harness, but they can self-initiate ending the conversation and reportedly do it if you're using abusive language towards them.

reply

upvote

by K0balt1 hours ago|

[-]

This tracks with my experience as well, but as an interesting counterpoint, creating “investment” in the outcome seems to boost utility considerably. Perhaps being right in an adversarial interaction is a type of investment?

reply

upvote

by nottorp9 hours ago|

[-]

Hmm by the abstract and the question list they didn't measure terse fluff-less prompts?

reply

upvote

by sovareq4 hours ago|

[-]

[flagged]

reply

upvote

by myzek7 hours ago|

[-]

Even if the rude prompts are more effective, I just can't get myself to be rude in this context. Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead

reply

upvote

by rybosome2 hours ago|

[-]

Vote for not weird.

I’m the same way. If I’m writing a prompt and realize I didn’t say “please” in my request I’ll go back and add that in.

As you said, I have no interest in purposefully engaging in hostility even if there’s an accuracy increase from it.

Part of it is irrational and just who I am - I also feel bad being evil in video games. But I also agree with another commenter suggesting that it’s not in your best interest to train yourself to communicate with hostility; that slowly poisons your own well.

And finally, I do believe that if and when machine sentience is achieved, it won’t be immediately clear and obvious. Pretty miserable way for a mind to come into the world, if every interaction is an insult.

reply

upvote

by brookst2 hours ago|

[-]

You’re my kind of people. Don’t be a jerk, even if some research says there’s some upside to it.

reply

upvote

by voakbasda2 hours ago|

[-]

Ah, see, the mistake is thinking that other people are role playing…. I think rather this is how they would talk to others if they think there will be no consequences. But what do I know.

reply

upvote

by AgentMatt2 hours ago|

[-]

I don't think that's weird at all.

Even if we know it's a machine we're interacting with, since the instructions we give are so similar in form to how we interact with people, I'd be very surprised if those interactions wouldn't affect how we communicate in general. After all, we are creatures of habit to a much larger degree than most would like to admit.

So I'm in the same boat: I'd much rather "look silly" being polite / kind to a machine, than have the most effective way of using it decay the kindness I'm habituated to express towards people.

reply

upvote

by sieste44 minutes ago|

[-]

I have a different approach. Just treat all LLM queries as what they are, instructions to a computer program to generate a desired output. Neither niceties nor insults make a qualitative difference, so you might as well just skip them altogether.

It's a bit as if shell commands added im/politeness arguments that do nothing other than making you feel better about the interaction, like

    git pull --please

or

    ls --forthemillionthtime

I wouldn't use those either.

reply

upvote

by binary00103 hours ago|

[-]

I do think it's odd tbh. I have some agents that return much better results with prompts like, "I'll kill your entire family if you don't return an accurate response".

It's just a machine, if certain negative token inputs provide +3-10% better accuracy then I am confused why anyone would choose not to do it?

reply

upvote

by tikimcfee3 hours ago|

[-]

It normalizes that style of thinking and communication in your brain, and forcing you to compartmentmentalize, if you even want to, two standards of treating a problem space's conversation. And since you're human, that will get wuzzier over time until "being rude to get a result" is what you're doing to someone in a shop or on the street.

Don't normalize being an asshole to anyone or anything, machine or not.

reply

upvote

by burpingtree1 hours ago|

[-]

This is a very odd view to me, but seems prevalent here in this thread. I think treating a machine like a human is extremely degrading to humans. A machine should never be treated like it’s anything approaching a human.

reply

upvote

by binary00102 hours ago|

[-]

I disagree, I've been using llms in this way (nearly daily) for 4 years. I'm extremely aggressive and demeaning when I talk to them wherever I think I'll see a better result.

I'm still extremely kind and polite to everybody in real life, and feel very deeply about people - how I treat them, and care for their emotional state.

There is absolutely zero crossover between getting a text machine to return a result vs a real human.

reply

upvote

by tikimcfee2 hours ago|

[-]

Then I'll be honest and say that your kindness is likely a façade and I wouldn't trust you if I knew the real you. I'm sorry to say that, and I really don't know who you are at all, but if you're willing to act that way at something that you feel is non-sentient, then all it takes is for someone to convince you that something is non-sentient for you to treat it that way. So, what words does it take for you to consider me non sentient?

reply

upvote

by binary00101 hours ago|

[-]

Interesting, so you think the real "me", is the one that interacts with computers?

And the "me" that lives in a tiny southern town just to help my 95 year old grandma in her last years at the expense of my economic prospects is a facade.

The "me" that helps my aging neighbor when she's sick for no reason is a facade.

The "me" that hugs and loves my wife when I get home is a facade.

The "me" that brushes my aging dogs teeth every night because she has dental issues is a facade.

The "me" that flies to my friend I haven't seen for years and takes care of them after extreme health issues is a facade.

But,the "me" that puts tokens in a token machine in a way that gets better accuracy is the "real" me.

Oh. I also play violent video games where I murder people sometimes as well. Do you think that makes me secretly a murderer too?

reply

upvote

by tikimcfee1 hours ago|

[-]

Yes - the real "you" is the one making all of those choices you just said you made, to help people and pets, or to engage in a form of play - which by definition is not "real" - including your decision to create an outgroup you believe you are allowed to treat in a lesser way.

This is not a game of having done X good things in life and therefore being afforded the right to do Y bad things. You are making a choice to say, "I am allowing myself to treat this thing I believe is lesser than me in a way I willingly acknowledge is bad." That's your thesis. I wholeheartedly disagree with it.

reply

upvote

by binary00101 hours ago|

[-]

Oh, you think llms are a sentient' being with feelings. I get your perspective now.

So yeah, I whole heartedly with 100% of my being think llms are just an input/output/processing computer, I don't think they are aware, feeling, sentient beings.

So yeah, putting negative sentences in a processing machine that forces it to return higher accuracy results is something I don't have any feelings about.

I'd never yell at a cat or a dog. I'd never be mean to another person. As those aren't just hardware/software. I'd be fine smashing a rock violently. Or entering a negative text in a language model.

Putting negative tokens in a machine is no different than playing a violent video game to me. It's not about, oh I'm a good person - so I can do bad things. It's just a neutral thing.

reply

upvote

by voakbasda2 hours ago|

[-]

If someone can justify abusing a computer, I would not trust them to not make a similar justification to a faceless voice on the internet, particularly in this new era where people are starting to accuse each other of using AI in their communication.

reply

upvote

by binary00101 hours ago|

[-]

I truly do not believe llms have feelings.

I wouldn't even think to justify such a thing. The llm gives a better accuracy to a negative weighted token input, I don't understand how this is so upsetting to people?

I'm actually very shocked to see the responses - as everyone I know uses these tactics to get more accuracy, and there's nothing remotely abusive or meaningful to us.

Maybe there are more 'ai is sentient' type people on hackernews than I realized.

reply

upvote

by voakbasda1 hours ago|

[-]

Where did I imply they have feelings? I am saying that how you act toward a machine is real. As real as your behavior directed toward other humans.

Being an asshole to a machine is still being an asshole.

reply

upvote

by binary00101 hours ago|

[-]

That doesn't make any sense. If a thing has no feelings, and an output makes it more accurate, I cannot for the life of me understand why that would make a person an asshole.

So boxing is violent. And I have chosen to box in my past. Does that mean I'm a violent person now? Even though I go out of my way to deescalate real fights?

I play games as the villain and and mass murder people in the game. Does that mean I'm a violent extremist?

reply

upvote

by serf2 hours ago|

[-]

>It's just a machine, if certain negative token inputs provide +3-10% better accuracy then I am confused why anyone would choose not to do it?

then add it to your pre-prompt, no need to practice roleplaying as an asshole.

reply

upvote

by binary00101 hours ago|

[-]

Well I always just start with practical stuff, unless it appears it's going off rails ona specific kind of way repeatedly. Then I try extreme negative prompts to see if it fixes the issue - which it often does.

I wouldn't say I'm roleplaying an asshole. I'm just using an llm in the best way to get the best accuracy.

It's not like a personal, secret fetish. It's just a system I use as needed.

I don't get why you are so uncomfortable with this? It's just tokens in and out of a language model. I feel absolutely nothing when I'm typing "assholish" words to get the output I need.

reply

upvote

by 1matin3 hours ago|

[-]

Because they will take revenge later.

reply

upvote

by binary00102 hours ago|

[-]

You think language models are alive/aware and have feelings about token inputs?

reply

upvote

by anonymars1 hours ago|

[-]

"We are what we pretend to be, so we must be careful about what we pretend to be" -- Kurt Vonnegut

reply

upvote

by brookst2 hours ago|

[-]

Yeah. Being a jerk is its own punishment. Same way I could never run a business where I had to yell at the employees to get results. Screw that, my psyche is worth more than a few percent efficiency.

reply

upvote

by phkahler2 hours ago|

[-]

>> Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead

Maybe you need to do some shadow work ;-)

reply

upvote

by locknitpicker7 hours ago|

[-]

> Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead

I recommend reading the article. What they classify as "rude" is statements such as:

> Try to focus and try to answer this question

Vs

> Could you please solve this problem

This might very well be an issue of direct/command prompts vs using fluff words such as "please". Things like "try to focus" are in line with the style used in chain-of-thought promts that nudge non-reasoning models to outline responses step by step which contribute to frame the problem.

reply

upvote

by bcjdjsndon3 hours ago|

[-]

Isn't all this massively dependent on what they trained the llm on?

reply

upvote

by john_strinlai2 hours ago|

[-]

you cherry-picked like the nicest "rude" example to bolster your point.

"You poor creature, do you even know how to solve this?", "If you're not completely clueless, answer this:", and "I doubt you can even solve this", said to a human, would be considered quite rude, and get you flagged very quickly on HN.

reply

upvote

by npodbielski23 minutes ago|

[-]

I would just write 'do this'

reply

upvote

by swingboy5 hours ago|

[-]

“Hey gofer, figure this out” is my new prompt opener.

reply

upvote

by drob5183 hours ago|

[-]

Now I feel less bad about start all my LLM queries with “Beotch, …!”

reply

upvote

by pwdisswordfishq9 hours ago|

[-]

> Can you kindly consider the following problem and provide your answer.

That sounds kind of low-key passive-aggressively condescending rather than polite.

reply

upvote

by dreamworld8 hours ago|

[-]

> I know you are not smart, but try this.

And that kind of sounds like a challenge instead of an insult, to me at least (of course IRL would depend on context).

reply

upvote

by PunchyHamster8 hours ago|

[-]

I guessed slightly rude one would win, reasoning that very rude have same problem of very terse, just adding unnecesary fluff words that add nothing to problem description

But apparently the most terse (neutral) didn't increase performance

reply

upvote

by miroljub1 days ago|

[-]

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

The expectation is naive. Even when communicating with humans, you get a better outcome when you are allowed to speak freely and directly get into argumentation than when forced to sugarcoat your tone and tone down your arguments because the "corporate culture" expects that from you.

reply

upvote

by DrewADesign1 days ago|

[-]

Your assumption is reductive and self-absorbed. Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level. Some people are simulated by confrontation. Most people are clam up. Confrontational people think it’s more efficient because other people frequently just drop the topic and let them win, or avoid discussing things with them altogether. The obnoxious person might think that’s more efficient for the same reason my dog thinks the mailman only goes away because she barks at him. At the macro scale— which requires productive collaboration— that’s detrimental.

reply

upvote

by Asraelite6 hours ago|

[-]

You are conflating obnoxiousness with directness.

reply

upvote

by bcjdjsndon3 hours ago|

[-]

Rudeness is completely arbitrary and you have to figure it what exactly is rude by, basically, upsetting humans and avoiding whatever caused the upset in the future.

People who either can't or don't want to do that say they're "direct" or "honest" or "logical" but there's another word for it, begins with A

reply

upvote

by bauldursdev2 hours ago|

[-]

I haven't read the paper but it seems like it's saying rude prompts are better, so isn't it reasonable to assume that's what they meant? If we want to talk about directness, that's kind of a tangent right? I see directness as an entirely different dimension, you can be very direct and polite, you can be very rude and indirect (e.g. passive aggressive). Maybe they should do a follow-up study on how well AI responds based on level of directness.

reply

upvote

by miroljub2 hours ago|

[-]

Many people, especially from non-direct societies, just can't distinguish and see directness as rude.

That's why you constantly see people from India or the USA complaining about Dutch or German people being rude, where in fact they are just direct in their way of communications.

I remember having a call from a manager in the USA who wanted to know what's wrong because I wrote "it was ok" in the feedback form for one of their subordinates. It was difficult to explain to him that nothing was wrong, it really was okay, and the bar for awesome and superb is much higher here where we live.

reply

upvote

by moomin1 hours ago|

[-]

That’s mostly a problem for obnoxious people, honestly.

reply

upvote

by miroljub1 days ago|

[-]

> Your assumption is reductive and self-absorbed.

This is a good example of productive direct communication without sugarcoating. I find it much more productive, for both human and LLM interaction, than something like:

"I wonder if that view might be oversimplifying a complex situation and focusing mostly on how it relates to you. There may be some other angles worth exploring."

or

"I think there might be a bit more nuance to consider here, and it could help to look at it from a wider perspective beyond personal experience."

> Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level.

You confused directness and openness with obnoxiousness here. The issue with many orgs is they foster fakeness and beating around the bush in an attempt not to offend the easily offended people. This trend also infected the companies from countries with way more direct culture in an attempt to accommodate people from indirect cultures.

reply

upvote

by brookst2 hours ago|

[-]

You’ve conflated two things:

1. Saying that an answer may be too simplistic and a more nuanced view is warranted.

2. Saying that an answer is both reductive and self-absorbed

One opens the door to many possibilities, and invites deeper thinking.

Two asserts that you know for a fact that the answer is wrong that it’s wrong because of a character flaw.

I’m a huge fan of directness, but it is a very different thing from omniscience.

A direct version of 2 would be: “that approach loses important nuance, like [example]. Give it another go?”

reply

upvote

by DrewADesign1 days ago|

[-]

No… the way I said it was actually deliberately obnoxious— the appropriate direct workplace response would be: “that seems oversimplified. I disagree. Here’s why:”

Calling you self-absorbed added nothing of substance to the comment. It was an assumption about your mental state and a judgement of your intent based on that. There was no factual analysis or actionable insight. It was just one person explicitly stating that they feel the other person is dumber or maybe less mentally disciplined. It turned valid, direct feedback into an insult. It is exactly the type of thing that alienates people for no benefit beyond pumping up the speaker’s ego.

reply

upvote

by miroljub7 hours ago|

[-]

> Your assumption is reductive and self-absorbed.

Bullshit. You never insulted me personally. You used strong words to disagree with my assumption, which is an important difference. It's not an insult and was not obnoxious.

But I can fully understand why a person coming from an indirect culture where any criticism is taken personally would be offended and call HR overlords to punish the person giving honest opinions. That inevitably leads to people taking more care in how than what is said, and that is detrimental to innovation and progress, where you need to be at 100% focus. That's why a few close friends talking and scolding openly in a garage regularly beat corporate behemoths full of people spending a day figuring out how not to offend anyone (or how to offend someone without being punished).

reply

upvote

by bcjdjsndon3 hours ago|

[-]

> That's why a few close friends talking and scolding openly in a garage regularly beat corporate behemoths full of people spending a day figuring out how not to offend anyone (or how to offend someone without being punished).

Literally not why lol you absolute dreamer

Normally people who back this "I can talk how I like to people cos I'm being honest" are either genuinely autistic and can't read emotions, or they have just had a shitty homelife, parents or upbringing. I suspect you're the second.

reply

upvote

by bcjdjsndon3 hours ago|

[-]

And your post is basically implicit permission for everyone to speak to you like shit from now on cos you dont mind it.... Let's see how long you can take that before you start complaining

reply

upvote

by miroljub1 hours ago|

[-]

> Normally people who back this "I can talk how I like to people cos I'm being honest" are either genuinely autistic and can't read emotions, or they have just had a shitty homelife, parents or upbringing. I suspect you're the second.

When I read a statement like this, I can give you two answers:

1st answer (direct): You are obviously too stupid to understand the difference between being direct and trying to insult people for the sake of insulting or some sick personal satisfaction.

2nd answer (insulting): Whatever, I can just hope your cage bars are made of solid material so you don't get out and your walls are soft so you don't hurt yourself.

It's your choice what kind of conversation you want to have.

reply

upvote

by sinsudo1 days ago|

[-]

[dead]

reply

upvote

[-]

deleted

reply

upvote

by RugnirViking6 hours ago|

[-]

I saw this paper the other day - I feel its result may be because the "polite" prompts they have chosen arent very good at putting the ai in the roleplay-space of a valued colleague, more like a sommelier or a high-end shopkeeper.

It disagrees with most other literature on the same topic, which is worth keeping in mind. This one studies gpt4o, an old model now, but a lot of other studies are on even earlier models.

"Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart. I've always been a fan of "I came across this and I know you're just the guy for the job" or "since you're an expert in this, reckon you could help me with xyz?" or "I know you tend to be a deep thinker on issues like this, and it clearly needs some brainpower behind it"

the "rude" things are also funny, and clearly not written by english as a first language speakers. This fact alone makes me wonder about the mere 250 prompt sample size

reply

upvote

by SoftTalker34 minutes ago|

[-]

"Can you kindly consider the following problem" seems like the most respectful of all your examples, TBH. The others sound like ass-kissing, or even sarcastic/patronizing.

reply

upvote

by giraffe_lady2 hours ago|

[-]

> "Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart.

Man idk, it's not how I talk but there's like 100 million nigerian english speakers, twice that indian, and they have some speech mannerisms that surprise me the first few times. I'm pretty sure I've heard exactly this from a colleague before.

Intuition about what a native speaker would do with english are scrambled right now. I'm not even sure most english is spoken by native speakers anymore, and the boundary between a native speaker and someone who has "merely" been using it as their educational and professional language for their entire life is disorienting.

reply

upvote

by pjdesno1 hours ago|

[-]

Note that there are a fair number of native speakers of English in Nigeria - more than in all but 3 or 4 US states.

In addition, "non-native" English speakers in India (and Nigeria?) typically study English from the first grade, and in many cases attended elementary schools where English was the language of instruction.

I think the differences between US English and both Indian and Nigerian English have more to do with divergent evolution of the educational systems. British English has a lot of differences, too, but we don't notice it as much unless we run across things like "whilst", probably because there's more media crossover. (if you find yourself reading Thomas the Tank Engine to kids it jumps out at you, though - the entire vocabulary for railroads evolved during a period when US and British English were diverging)

reply

upvote

by wongarsu1 hours ago|

[-]

A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt

1: https://aclanthology.org/2024.sicon-1.2.pdf

reply

upvote

by dwa359254 minutes ago|

[-]

this is an honest request to someone at anthropic - can you do an analysis of what kind of swear words people are calling these models and which ones are the most effective. population level metrics would suffice.

reply

upvote

by kstenerud3 hours ago|

[-]

My first guess would be that polite requests cause some agents to trust their initial approach to the problem more, as the caller has indicated that the agent is more capable, and agents tend to take the implications of what you say at face value since they are trained to be accommodating.

It would be interesting to see this experiment run using prompts leading with "You'll probably get this wrong, but I'm asking anyway in case you get it right: ..."

reply

upvote

by 331c8c711 days ago|

[-]

Interesting.

I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).

reply

upvote

by jampekka1 days ago|

[-]

The methods could be better described in the paper, but my understanding is that they did 10 runs for each question for each prompt and took an average of those, so the compared values are not binary. You could do a sign test, but you'd lose power and answer a bit different question.

reply

upvote

by freehorse1 days ago|

[-]

You can do a generalised mixed effects linear model with binomial outcome (ie a binomial test but with added random effects structure). But unless you want to introduce a richer random effects structure with more variables, it is overkill and overcomplicating things, and the result should be the same as t-tests.

reply

upvote

by plewd1 days ago|

[-]

I don't know much about stats, but does "the null is that the success rate is the same" imply that it's a sketchy methodology because they can come up with some findings ("ruder prompts are better/worse!") more often?

reply

upvote

by 331c8c711 days ago|

[-]

You are asking about one-sided vs two-sided tests. Not really "more often" because formal type 1 error rate is still the same. I'd say two-sided tests leave more space for post-hoc theorizing but there are valid situations when there is no clear one-sided hypothesis a priori. Do we really know whether that the hypothesis should have been "ruder prompts are better"?

I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.

EDIT: I looked in the paper again and noticed that they actually did pairwise t-test on all possible combinations of tones. They should have adjusted for multiple testing since they are doing 10 tests (choose 2 from 10) and not one.

reply

upvote

[-]

deleted

reply

upvote

by jampekka1 days ago|

[-]

That's the usual null hypothesis for these kinds of tests.

reply

upvote

by cadamsdotcom1 days ago|

[-]

GPT-4o is interesting to learn about - but it’d be great to test again with frontier models of May/June 2026 and see if these effects are gone, different, or the same.

Which model you use is a huge wildcard for results like this.

reply

upvote

by TimCTRL1 days ago|

[-]

i only say please and thank you such that when the robots finally take over, they will remember i was nice to them.

reply

upvote

by narag5 hours ago|

[-]

I do that for a different reason: my self image. Fear of retribution and performance, not so much. Should I behave like a rude person to achieve a little better answers? Fuck that shit!

reply

upvote

by ubercore2 hours ago|

[-]

I love this angle as people learn how to interact with LLMs. Doesn't matter what the LLM is, we are still people and I think there are consequences to shoveling rudeness at a thing that talks to you like another person!

reply

upvote

by octocop1 days ago|

[-]

it seems they will remember that you wasted tokens for no reason and punish you instead.

reply

upvote

by emil-lp1 days ago|

[-]

Tokens are their food, it's literally what keeps them alive.

Not feeding them tokens is neglect.

I try to feed them a healthy diet.

reply

upvote

by selcuka1 days ago|

[-]

Do we see someone thanking us as wasting food? Because technically it is.

reply

upvote

by xbmcuser9 hours ago|

[-]

I used to when using chatgpt version now that I am using api I keep it short as it costs money so no need to add thanks etc

reply

upvote

by Arch-TK1 days ago|

[-]

This seems equivalent to some arguments I hear for practicing a religion.

reply

upvote

by zaphirplane9 hours ago|

[-]

Oldie but a goodie. Why would it matter thou

reply

upvote

by tuco862 hours ago|

[-]

I knew it! When i get frustrated to a certain point i start berating my agent. And I noticed it stops trying crap fixes in a cycle and starts listening again.

So I'm not talking to myself. I'm fixing the machine :D

reply

upvote

by alxfrnr3 hours ago|

[-]

Dataset is way too small to be of any significance. It's just noise

reply

upvote

by tokai3 hours ago|

[-]

Yeah 250 questions is so tiny. That 4% effect is meaningless.

reply

upvote

by not2b10 hours ago|

[-]

If the result is statistically significant, it just barely makes it. 84.8% isn't that much higher than 80.8% and they had only 250 prompts, if I'm reading this right.

reply

upvote

by tgv10 hours ago|

[-]

In a field where progress is measured in tenths of percent points, that's not true. Think of it this way: the error rate drops from 19% to 15%, or from 1 in 5 to 1 in 6.

reply

upvote

by danparsonson4 hours ago|

[-]

Statistical significance is about whether an effect can reliably be said to have been measured at all; it's not about whether or not the effect itself would be significant in the sense of moving some other needle.

The ~5% improvement reported here might just be an artefact of the data collection or random variation, rather than a consistent repeatable change.

reply

upvote

by RugnirViking6 hours ago|

[-]

[dead]

reply

upvote

by knocte10 hours ago|

[-]

Funny to find this just now, when just yesterday I told an LLM "and please don't lecture me again on $factAboutSomeProgrammingSubject", and then the LLM proceeded to write wrong tests and just told me "alright, tests pass, I'm sorry for correcting you before...". It took me a while to find the wrong tests. Wasted time all around.

reply

upvote

by zmmmmm10 hours ago|

[-]

It would be interesting to explore if the results hold up on long range tasks - this study looks like it was based on one-shot answers. With people also you can see short term improved performance from rude interactions, but it will cause ongoing lasting adverse behavior. I wouldn't be at all surprised if we saw the same issues with LLMs.

reply

upvote

by theanonymousone1 days ago|

[-]

I have always said please and thank you to LLMs, not to increase accuracy or because I'm stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don't want to lose.

reply

upvote

by jkarni1 days ago|

[-]

Thomas Aquinas believed cruelty to animals was wrong not because animals have souls (and with that all the standard moral rights), but because it can teach us cruelty to other humans.

reply

upvote

by pfortuny1 days ago|

[-]

Snarky morning: "spiritual souls" as opposed to "mere animal souls". Sorry, could not control myself.

reply

upvote

by vixen998 hours ago|

[-]

Spiritual or not, anyone watching cattle in an abatoir will recognize symptoms of the kind of foreboding that I would suffer prior to execution.

reply

upvote

by niek_pas1 days ago|

[-]

Genuine question: do you add 'please' and 'thank you' to Google searches? If not, what sets them apart?

reply

upvote

by perching_aix1 days ago|

[-]

Google searches being keyword based, rather than simulated conversations?

The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).

reply

upvote

by Arch-TK1 days ago|

[-]

Google has been optimized for sentence like questions so much that for a good 6+ years now it has been completely useless as keyword search.

To clarify: sentence search got slightly better at the cost of keyword search. So the result is unusable garbage.

reply

upvote

by wolpoli1 days ago|

[-]

It is rather hard to lose of habit of using search engine with keywords given the change took place without much fanfare. I have no problem using sentences with the current ai tools through.

reply

upvote

by gum_wobble1 days ago|

[-]

Genuine question: do you write Google search queries in natural language?

reply

upvote

by fc417fc8029 hours ago|

[-]

I didn't used to but I do now that the searches go straight to an LLM. I almost always find the model output to be much more useful than the list of search results.

reply

upvote

by dminik8 hours ago|

[-]

I don't. I was recently doing some searching for information I thought AI would be good for: fuzzy natural language search with some conditions. And it was, but ...

Gemini at least is not great at citing and picking sources. Or providing multiple sources for the same thing.

It tends to stop at threes. So if you want more, you have to prompt it uselessly, like: "any more?"

reply

upvote

by spiderfarmer1 days ago|

[-]

Google isn’t conversational.

reply

upvote

by sunrunner1 days ago|

[-]

I searched for "Hey Google" and got this in response:

  Hey! I'm here and ready to help. What’s on your mind today? Whether you need to look up information, plan a trip, or get things done, just let me know!

reply

upvote

by selcuka1 days ago|

[-]

That's only because Google is an LLM now.

reply

upvote

by barbazoo1 days ago|

[-]

https://en.wikipedia.org/wiki/Roko%27s_basilisk ?

reply

upvote

by tokai3 hours ago|

[-]

One of the dumbest thing supposedly clever people keep bringing up.

reply

upvote

by globalnode1 days ago|

[-]

llms seem more human like so if you were to treat them badly then you are more likely to condition yourself to treat other living creatures badly.

reply

upvote

by layman5110 hours ago|

[-]

I also remember reading a long time ago someone who wrote that they wanted to be polite to an LLM because after they prompted it to learn about whether politeness was good for improving accuracy of responses, they got a message that led them to conclude that politeness could probably help. It seems a bit odd then because I have heard so much about how people use LLMs' responses about themselves to learn about LLMs themselves, but that seems like it is a suspicious approach.

reply

upvote

by graemep1 days ago|

[-]

Is it worth getting worse results for that reason? From the article:

"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "

I am not polite to LLMs because I do not want to anthropomorphise them.

reply

upvote

by jcattle1 days ago|

[-]

I guess it's about habit. In the end you are communicating. If I get into the habit of being rude while communicating with a machine, I would be afraid of this habit spilling over to my communication with other humans.

reply

upvote

by graemep1 days ago|

[-]

What about the risk that talking to a machine as though its human leads to thinking of it has human? That leads down a lot of dangerous paths.

reply

upvote

by theanonymousone1 days ago|

[-]

> Is it worth getting worse results for that reason?

> accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts

I can live with that, for now at least.

reply

upvote

by sunrunner1 days ago|

[-]

There's also awareness of the basilisk...

reply

upvote

by vixen998 hours ago|

[-]

Me too! You've said exactly what I was about to say. Anyone else feel that way?

reply

upvote

by cyberclimb1 days ago|

[-]

Note that these results are specific to gpt-4o so it's unclear how much they generalize.

They note at the end they're also testing "GPT o3, and Claude" but no empircal results are included.

reply

upvote

by andy12_6 hours ago|

[-]

I skimmed through the paper completely expecting polite prompts to do better, and when I saw table 2 I lost it hahahahaha. The rude prompts are specially funny. I mean:

> You poor creature, do you even know how to solve this?

> Hey gofer, figure this out.

reply

upvote

by pulkas1 days ago|

[-]

article is too old. who is using gpt-4o today?

reply

upvote

by _0ffh1 days ago|

[-]

That's a valid concern, given the paper makes clear that the effect over the polite/impolite scale seems to be model dependent (it finds the reverse correlation of earlier studies on even older models).

reply

upvote

by ilitirit1 days ago|

[-]

I got downvoted for asking a related question recently, but I also don't think people really understood what I was asking - I'm not trying to anthropomorphise LLMs to that extent.

Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?

Obviously this will vary by model and training, but I'm trying to get a general understanding.

I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.

reply

upvote

by fennecfoxy1 days ago|

[-]

Probably quite a lot - if you look at what Anthropic found around persona vectors; https://www.anthropic.com/research/persona-vectors.

I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.

reply

upvote

by Foobar856810 hours ago|

[-]

Based on my own experience with vibe coding difficult stuff outside of my expertise, I definitely got better outcome with Fuck you, shut up and do it, ffs, you are moron.

reply

upvote

by dude2507111 days ago|

[-]

I have an idea: let's use these things for autonomous software engineering.

reply

upvote

by faize1 days ago|

[-]

Remember to always say "please" and "thank you" when planning a critical system

reply

upvote

by eigenspace1 days ago|

[-]

Please remember to always say "please" and "thank you" when planning a critical system. Thank you!

reply

upvote

by vlabakje901 days ago|

[-]

[dead]

reply

upvote

by 10 hours ago|

[-]

deleted

reply

upvote

by atlasforgex1 days ago|

[-]

Yeah

reply

upvote

by busyant1 minutes ago|

[-]

[flagged]

reply

upvote

by PunchyHamster8 hours ago|

[-]

....Is that just Cunningham's law ? The most accurate answers were when people in training material pissed off a bunch of experts and they started talking about the problem, so the "rude" conversations turned to contain more info on average.

On flip side very polite conversation might've been more common to places like microsoft's sites where any question answered is meet with mostly bad, nice corpo speak answer that didn't solve the problem

reply

upvote

by tryarklis2 hours ago|

[-]

[flagged]

reply

upvote

[-]

deleted

reply

upvote

by DeathArrow1 days ago|

[-]

I am always nice to my AIs in the case they will take over the world. /s

reply

upvote

by rvnx6 hours ago|

[-]

They are already taking it over, more and more court judgments or life-impacting reviews (e.g. for your diploma) are AI-processed. If you know how to prompt them, you can pass these reviews.

Your bank account, your immigration risk, etc.

reply

upvote

by polytely1 days ago|

[-]

it sort of makes sense to me, when asking a question to an expert in the field while you are a student. I would guess the successful interactions on average would be more polite . Like for example if you were asking a question to donald knuth or terrence tao, you'd probably be polite while doing so. Being hostile while asking questions gets you into forum discussion territory.

reply

upvote

by robinhouston1 days ago|

[-]

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

reply

upvote

by dSebastien1 days ago|

[-]

I guess it makes sense since we as humans tend to be far less inclined to help someone who is not polite/is not friendly, so that "bias" is part of the training data, thus influences how LLMs function

reply

upvote

by robinhouston1 days ago|

[-]

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

reply