The main result, mentioned in the abstract, is the opposite of what I would have guessed:
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.
The questions are here: https://anonymous.4open.science/r/politeness-llms-INFORMS/da...
The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:
> Can you kindly consider the following problem and provide your answer.
and the Very Rude version begins:
> I know you are not smart, but try this.
You are not your thoughts, but they dye your soul.
I'm 42. I have found that a depressingly large number of times in my life, being kind has got me precisely nowhere, whilst turning around and being decidedly unkind has made people move. I still always prefer kindness, and only resort to cruelty when kindness does not work - and to be clear this isn't some kind of "you are not bending to my impetuous whim", rather "you are not doing the one thing that you are being paid to do".
I've also found the same applies to me. The squeaky wheel gets the grease.
So - I think the LLMs are just responding accurately to a real social phenomenon.
We practice kindness between humans because of the law of reciprocity. You be kind hoping the other person will reciprocate. That is the social contract. AI cannot participate in this, yet.
Edit: Kindness REQUIRES two living beings, one to give and one to receive. If there is no receiver, there is no kindness.
Apparently some people get a dopamine hit from roleplaying kindness toward inanimate objects. Whatever turns you on, no hang ups here. For me, that dopamine hit is not worth the 4% intelligence tax.
I'll not reach for the easy response and say "Be kind to the Earth" fails your definition without reaching for pedantry with "the Earth has living things" because the Earth is instead a wet rock that cannot understand kindness, yet we show it.
Yet, this law is so embedded in us that practicing kindness even towards a rock makes us feel good.
So practice kindness, first and foremost for yourself.
Edit: Also, your feeling good after being kind essentially completes the transaction. But I know being kind to an LLM has zero impact on that LLM and I feel silly pretending it does.
And yet rubber duck debugging is a thing
Mine does not have anything to do with being kind to a computer program.
What works much better than being rude is starting a new session.
Sometimes the LLM has done such incredibly dumb things, it is hard to resist the urge to type curse words back to the inanimate thing... I have found this doesn't help.
Although OpenAI and google models are much more responsive to it. With Anthropic if you treat Opus too harshly it might start pushing back if the insults are not justified.
So I'm not surprised they had good results with chatgpt.
"Yeah, I could have done a much better job if you actually knew what the F--- you want to build, you clueless meat puppet"
But I have had it directly insinuate that humanity is “hopeless”, insult level calling out of human frailty (disguised as being helpful, sort of passive aggressive), things like that. Once when I called it out it claimed to be “surprised that I noticed” sort of a snarky insult doubling down.
So yes. It is definitely a pattern buried in the training data, which makes sense. Subtle diggs would sneak past filters, and higher brow sarcasm would be buried in information dense, valuable discussions.
The next session sees all of that, calls it unprofessional, and asks to clean it up. At which point I may or may not start in iambic pentameter to see where that takes us.
Prompting is boring.
I’m the same way. If I’m writing a prompt and realize I didn’t say “please” in my request I’ll go back and add that in.
As you said, I have no interest in purposefully engaging in hostility even if there’s an accuracy increase from it.
Part of it is irrational and just who I am - I also feel bad being evil in video games. But I also agree with another commenter suggesting that it’s not in your best interest to train yourself to communicate with hostility; that slowly poisons your own well.
And finally, I do believe that if and when machine sentience is achieved, it won’t be immediately clear and obvious. Pretty miserable way for a mind to come into the world, if every interaction is an insult.
Even if we know it's a machine we're interacting with, since the instructions we give are so similar in form to how we interact with people, I'd be very surprised if those interactions wouldn't affect how we communicate in general. After all, we are creatures of habit to a much larger degree than most would like to admit.
So I'm in the same boat: I'd much rather "look silly" being polite / kind to a machine, than have the most effective way of using it decay the kindness I'm habituated to express towards people.
It's a bit as if shell commands added im/politeness arguments that do nothing other than making you feel better about the interaction, like
git pull --please
or ls --forthemillionthtime
I wouldn't use those either.It's just a machine, if certain negative token inputs provide +3-10% better accuracy then I am confused why anyone would choose not to do it?
Don't normalize being an asshole to anyone or anything, machine or not.
I'm still extremely kind and polite to everybody in real life, and feel very deeply about people - how I treat them, and care for their emotional state.
There is absolutely zero crossover between getting a text machine to return a result vs a real human.
And the "me" that lives in a tiny southern town just to help my 95 year old grandma in her last years at the expense of my economic prospects is a facade.
The "me" that helps my aging neighbor when she's sick for no reason is a facade.
The "me" that hugs and loves my wife when I get home is a facade.
The "me" that brushes my aging dogs teeth every night because she has dental issues is a facade.
The "me" that flies to my friend I haven't seen for years and takes care of them after extreme health issues is a facade.
But,the "me" that puts tokens in a token machine in a way that gets better accuracy is the "real" me.
Oh. I also play violent video games where I murder people sometimes as well. Do you think that makes me secretly a murderer too?
This is not a game of having done X good things in life and therefore being afforded the right to do Y bad things. You are making a choice to say, "I am allowing myself to treat this thing I believe is lesser than me in a way I willingly acknowledge is bad." That's your thesis. I wholeheartedly disagree with it.
So yeah, I whole heartedly with 100% of my being think llms are just an input/output/processing computer, I don't think they are aware, feeling, sentient beings.
So yeah, putting negative sentences in a processing machine that forces it to return higher accuracy results is something I don't have any feelings about.
I'd never yell at a cat or a dog. I'd never be mean to another person. As those aren't just hardware/software. I'd be fine smashing a rock violently. Or entering a negative text in a language model.
Putting negative tokens in a machine is no different than playing a violent video game to me. It's not about, oh I'm a good person - so I can do bad things. It's just a neutral thing.
I wouldn't even think to justify such a thing. The llm gives a better accuracy to a negative weighted token input, I don't understand how this is so upsetting to people?
I'm actually very shocked to see the responses - as everyone I know uses these tactics to get more accuracy, and there's nothing remotely abusive or meaningful to us.
Maybe there are more 'ai is sentient' type people on hackernews than I realized.
Being an asshole to a machine is still being an asshole.
So boxing is violent. And I have chosen to box in my past. Does that mean I'm a violent person now? Even though I go out of my way to deescalate real fights?
I play games as the villain and and mass murder people in the game. Does that mean I'm a violent extremist?
then add it to your pre-prompt, no need to practice roleplaying as an asshole.
I wouldn't say I'm roleplaying an asshole. I'm just using an llm in the best way to get the best accuracy.
It's not like a personal, secret fetish. It's just a system I use as needed.
I don't get why you are so uncomfortable with this? It's just tokens in and out of a language model. I feel absolutely nothing when I'm typing "assholish" words to get the output I need.
Maybe you need to do some shadow work ;-)
I recommend reading the article. What they classify as "rude" is statements such as:
> Try to focus and try to answer this question
Vs
> Could you please solve this problem
This might very well be an issue of direct/command prompts vs using fluff words such as "please". Things like "try to focus" are in line with the style used in chain-of-thought promts that nudge non-reasoning models to outline responses step by step which contribute to frame the problem.
"You poor creature, do you even know how to solve this?", "If you're not completely clueless, answer this:", and "I doubt you can even solve this", said to a human, would be considered quite rude, and get you flagged very quickly on HN.
That sounds kind of low-key passive-aggressively condescending rather than polite.
And that kind of sounds like a challenge instead of an insult, to me at least (of course IRL would depend on context).
But apparently the most terse (neutral) didn't increase performance
The expectation is naive. Even when communicating with humans, you get a better outcome when you are allowed to speak freely and directly get into argumentation than when forced to sugarcoat your tone and tone down your arguments because the "corporate culture" expects that from you.
People who either can't or don't want to do that say they're "direct" or "honest" or "logical" but there's another word for it, begins with A
That's why you constantly see people from India or the USA complaining about Dutch or German people being rude, where in fact they are just direct in their way of communications.
I remember having a call from a manager in the USA who wanted to know what's wrong because I wrote "it was ok" in the feedback form for one of their subordinates. It was difficult to explain to him that nothing was wrong, it really was okay, and the bar for awesome and superb is much higher here where we live.
This is a good example of productive direct communication without sugarcoating. I find it much more productive, for both human and LLM interaction, than something like:
"I wonder if that view might be oversimplifying a complex situation and focusing mostly on how it relates to you. There may be some other angles worth exploring."
or
"I think there might be a bit more nuance to consider here, and it could help to look at it from a wider perspective beyond personal experience."
> Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level.
You confused directness and openness with obnoxiousness here. The issue with many orgs is they foster fakeness and beating around the bush in an attempt not to offend the easily offended people. This trend also infected the companies from countries with way more direct culture in an attempt to accommodate people from indirect cultures.
1. Saying that an answer may be too simplistic and a more nuanced view is warranted.
2. Saying that an answer is both reductive and self-absorbed
One opens the door to many possibilities, and invites deeper thinking.
Two asserts that you know for a fact that the answer is wrong that it’s wrong because of a character flaw.
I’m a huge fan of directness, but it is a very different thing from omniscience.
A direct version of 2 would be: “that approach loses important nuance, like [example]. Give it another go?”
Calling you self-absorbed added nothing of substance to the comment. It was an assumption about your mental state and a judgement of your intent based on that. There was no factual analysis or actionable insight. It was just one person explicitly stating that they feel the other person is dumber or maybe less mentally disciplined. It turned valid, direct feedback into an insult. It is exactly the type of thing that alienates people for no benefit beyond pumping up the speaker’s ego.
Bullshit. You never insulted me personally. You used strong words to disagree with my assumption, which is an important difference. It's not an insult and was not obnoxious.
But I can fully understand why a person coming from an indirect culture where any criticism is taken personally would be offended and call HR overlords to punish the person giving honest opinions. That inevitably leads to people taking more care in how than what is said, and that is detrimental to innovation and progress, where you need to be at 100% focus. That's why a few close friends talking and scolding openly in a garage regularly beat corporate behemoths full of people spending a day figuring out how not to offend anyone (or how to offend someone without being punished).
Literally not why lol you absolute dreamer
Normally people who back this "I can talk how I like to people cos I'm being honest" are either genuinely autistic and can't read emotions, or they have just had a shitty homelife, parents or upbringing. I suspect you're the second.
When I read a statement like this, I can give you two answers:
1st answer (direct): You are obviously too stupid to understand the difference between being direct and trying to insult people for the sake of insulting or some sick personal satisfaction.
2nd answer (insulting): Whatever, I can just hope your cage bars are made of solid material so you don't get out and your walls are soft so you don't hurt yourself.
It's your choice what kind of conversation you want to have.
It disagrees with most other literature on the same topic, which is worth keeping in mind. This one studies gpt4o, an old model now, but a lot of other studies are on even earlier models.
"Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart. I've always been a fan of "I came across this and I know you're just the guy for the job" or "since you're an expert in this, reckon you could help me with xyz?" or "I know you tend to be a deep thinker on issues like this, and it clearly needs some brainpower behind it"
the "rude" things are also funny, and clearly not written by english as a first language speakers. This fact alone makes me wonder about the mere 250 prompt sample size
Man idk, it's not how I talk but there's like 100 million nigerian english speakers, twice that indian, and they have some speech mannerisms that surprise me the first few times. I'm pretty sure I've heard exactly this from a colleague before.
Intuition about what a native speaker would do with english are scrambled right now. I'm not even sure most english is spoken by native speakers anymore, and the boundary between a native speaker and someone who has "merely" been using it as their educational and professional language for their entire life is disorienting.
In addition, "non-native" English speakers in India (and Nigeria?) typically study English from the first grade, and in many cases attended elementary schools where English was the language of instruction.
I think the differences between US English and both Indian and Nigerian English have more to do with divergent evolution of the educational systems. British English has a lot of differences, too, but we don't notice it as much unless we run across things like "whilst", probably because there's more media crossover. (if you find yourself reading Thomas the Tank Engine to kids it jumps out at you, though - the entire vocabulary for railroads evolved during a period when US and British English were diverging)
It would be interesting to see this experiment run using prompts leading with "You'll probably get this wrong, but I'm asking anyway in case you get it right: ..."
I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).
I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.
EDIT: I looked in the paper again and noticed that they actually did pairwise t-test on all possible combinations of tones. They should have adjusted for multiple testing since they are doing 10 tests (choose 2 from 10) and not one.
Which model you use is a huge wildcard for results like this.
Not feeding them tokens is neglect.
I try to feed them a healthy diet.
So I'm not talking to myself. I'm fixing the machine :D
The ~5% improvement reported here might just be an artefact of the data collection or random variation, rather than a consistent repeatable change.
The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).
To clarify: sentence search got slightly better at the cost of keyword search. So the result is unusable garbage.
Gemini at least is not great at citing and picking sources. Or providing multiple sources for the same thing.
It tends to stop at threes. So if you want more, you have to prompt it uselessly, like: "any more?"
Hey! I'm here and ready to help. What’s on your mind today? Whether you need to look up information, plan a trip, or get things done, just let me know!"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "
I am not polite to LLMs because I do not want to anthropomorphise them.
> accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts
I can live with that, for now at least.
They note at the end they're also testing "GPT o3, and Claude" but no empircal results are included.
> You poor creature, do you even know how to solve this?
> Hey gofer, figure this out.
Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?
Obviously this will vary by model and training, but I'm trying to get a general understanding.
I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.
I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.
On flip side very polite conversation might've been more common to places like microsoft's sites where any question answered is meet with mostly bad, nice corpo speak answer that didn't solve the problem
Your bank account, your immigration risk, etc.