undefined

points

by TeMPOraL11 hours ago |

comments

by jstummbillig10 hours ago|

[-]

What do you mean? The page explicitly states:

> cutting ~75% of tokens while keeping full technical accuracy.

I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.

An explanation that explains nothing is not very interesting.

by prodigycorp9 hours ago|

parent|

[-]

The burden of proof is on the author to provide at least one type of eval for making that claim.

by jstummbillig9 hours ago|

parent|

[-]

I notice that the number of people confidently talking about "burden of proof" and whose it allegedly is in the context of AI has gone up sharply.

Nobody has to proof anything. It can give your claim credibility. If you don't provide any, an opposing claim without proof does not get any better.

by prodigycorp9 hours ago|

parent|

[-]

Sorry I don't know how engaging in this could lead to anything productive. There's already literature out there that gives credence to TeMPOraL claim. And, after a certain point, gravity being the reason that things fall becomes so self evident that every re-statements doesnt not require proof.

by xgulfie8 hours ago|

parent|

[-]

LLM quirks are not something all humans have been experiencing for thousands of years

by jmye6 hours ago|

parent|

prev|

[-]

> Nobody has to proof anything. It can give your claim credibility

“I don’t need to provide proof to say things” is a valueless, trivial assertion that adds no value whatsoever to any discussion anyone has ever had.

If you want to pretend this is a claim that should be taken seriously, a lack of evidence is damning. If you just want to pass the metaphorical bong and say stupid shit to each other with no judgment and no expectation, then I don’t know what to tell you. Maybe X is better for that.

by systoll9 hours ago|

parent|

prev|

[-]

The author pretended they addressed the obvious criticism.

You can read the skill. They didn't do anything to mitigate the issue, so the criticism is valid.

by getpokedagain9 hours ago|

parent|

prev|

[-]

In the age of vibe coding and that we are literally talking about a single markdown file I am sure this has been well tested and achieves all of its goals with statistical accuracy, no side effects with no issues.

by samusiam8 hours ago|

parent|

prev|

[-]

> I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.

But they didn't address the criticism. "cutting ~75% of tokens while keeping full technical accuracy" is an empirical claim for which no evidence was provided.

by dTal6 hours ago|

prev|

[-]

Yeah but not all tokens are created equal. Some tokens are hard to predict and thus encode useful information; some are highly predictable and therefore don't. Spending an entire forward pass through the token-generation machine just to generate a very low-entropy token like "is" is wasteful. The LLM doesn't get to "remember" that thinking, it just gets to see a trivial grammar-filling token that a very dumb LLM could just as easily have made. They aren't stenographically hiding useful computation state in words like "the" and "and".

by krackers2 hours ago|

parent|

[-]

>They aren't stenographically hiding useful computation state in words like "the" and "and".

When producing a token the model doesn't just emit the final token but you also have the entire hidden states from previous attention blocks. These hidden states are mixed into the attention block of future tokens (so even though LLMs are autoregressive where a token attends to previous tokens, in terms of a computational graph this means that the hidden states of previous tokens are passed forward and used to compute hidden states of future tokens).

So no it's not wasteful, those low-perplexity tokens are precisely spots that can instead be used to do plan ahead and do useful computation.

Also I would not be sure that even the output tokens are purely "filler". If you look at raw COT, they often have patterns like "but wait!" that are emitted by the model at crucial pivot points. Who's to say that the "you're absolutely right" doesn't serve some other similar purpose of forcing the model into one direction of adjusting its priors.

by dTal31 minutes ago|

parent|

[-]

Huh okay, there was a major gap in my mental model. Thanks for helping to clear it up.

by krackers9 minutes ago|

parent|

[-]

Well to be fair the fact that they "can" doesn't mean models necessarily do it. You'd need some interp research to see if they actually do meaningfully "do other computations" when processing low perplexity tokens. But the fact that by the computational graph the architecture should be capable of it, means that _not_ doing this is leaving loss on the table, so hopefully optimizer would force it to learn to so.

by Chance-Device5 hours ago|

parent|

prev|

[-]

> They aren't stenographically hiding useful computation state in words like "the" and "and".

Do you know that is true? These aren’t just tokens, they’re tokens with specific position encodings preceded by specific context. The position as a whole is a lot richer than you make it out to be. I think this is probably an unanswered empirical question, unless you’ve read otherwise.

by dTal5 hours ago|

parent|

[-]

I am quite certain.

The output is "just tokens"; the "position encodings" and "context" are inputs to the LLM function, not outputs. The information that a token can carry is bounded by the entropy of that token. A highly predictable token (given the context) simply can't communicate anything.

Again: if a tiny language model or even a basic markov model would also predict the same token, it's a safe bet it doesn't encode any useful thinking when the big model spits it out.

by Chance-Device5 hours ago|

parent|

[-]

I just don’t share your certainty. You may or may not be right, but if there isn’t a result showing this, then I’m not going to assume it.

by 8note5 hours ago|

parent|

prev|

[-]

can you prove this?

train an LLM to leave out the filler words, and see it get the same performance at a lower cost? or do it at token selection time?

by dTal4 hours ago|

parent|

[-]

Low entropy is low entropy. You can prove it by viewing the logits of the output stream. The LLM itself will tell you how much information is encoded in each token.

Or if you prefer, here's a Galilean thought experiment: gin up a script to get a large language model and a tiny language model to predict the next token in parallel; when they disagree, append the token generated by the large model. Clearly the large model will not care that the "easy" tokens were generated by a different model - how could it even know? Same token, same result. And you will find that the tokens that they agree on are, naturally, the filler words.

To be clear, this observation merely debunks the idea that filler words encode useful information, that they give the LLM "room to think". It doesn't directly imply that an LLM that omits filler words can be just as smart, or that such a thing is trivial to make. It could be that highly predictable words are still important to thought in some way. It could be that they're only important because it's difficult to copy the substance of human thought without also capturing the style. But we can be very sure that what they aren't doing is "storing useful intermediate results".

by vova_hn210 hours ago|

prev|

[-]

Yeah, I don't think that "I'd be happy to help you with that" or "Sure, let me take a look at that for you" carries much useful signal that can be used for the next tokens.

by lanyard-textile7 hours ago|

parent|

[-]

You'd be surprised -- This could match on the model's training to proceed using a tool, for example.

by jerf9 hours ago|

parent|

prev|

[-]

There is a study that shows that what the model is doing behind the scenes in those cases is a lot more than just outputting those tokens.

For an LLM, tokens are thought. They have no ability to think, by whatever definition of that word you like, without outputting something. The token only represents a tiny fraction of the internal state changes made when a token is output.

Clearly there is an optimal for each task (not necessarily a global one) and a concrete model for a given task can be arbitrarily far from it. But you'd need to test it out for each case, not just assume that "less tokens = more better". You can be forcing your model to be dumber without realizing it if you're not testing.

by DonHopkins9 hours ago|

parent|

[-]

High dimensional vectors are thought (insofar as you can define what that even means). Tokens are one dimensional input that navigates the thought, and output that renders the thought. The "thinking" takes place in the high dimension space, not the one dimensional stream of tokens.

by gchamonlive8 hours ago|

parent|

[-]

But isn't the one dimensional tokens a reflex of high dimensional space? What you see is "sure let's take a look at that" but behind the curtains it's actually an indication that it's searching a very specific latent space which might be radically different if those tokens didn't exist. Or not. In any case, you can't just make that claim and isolate those two processes. They might be totally unrelated but they also might be tightly interconnected.

by sheiyei8 hours ago|

parent|

[-]

I assume in practice, filler words do nothing of value. When words add or mean nothing (their weights are basically 0 in relation to the subject), I don't see why they'd affect what the model outputs (except cause more filler words)?

by gchamonlive8 hours ago|

parent|

[-]

Politeness have impact (https://arxiv.org/abs/2402.14531) so I wouldn't be too fast to make any kind of claim with a technology we don't know exactly how it works.

by xgulfie8 hours ago|

parent|

prev|

[-]

[flagged]

by rokob7 hours ago|

parent|

[-]

[flagged]

by xgulfie7 hours ago|

parent|

[-]

[flagged]

by wzdd9 hours ago|

parent|

prev|

[-]

They carry information in regular human communication, so I'm genuinely curious why you'd think they would not when an LLM outputs them as part of the process of responding to a message.

by andy998 hours ago|

prev|

[-]

I’ve heard this, I don’t automatically believe it nor do I understand why it would need to be true, I’m still caught on the old fashioned idea that the only “thinking” for autoregressive modes happens during training.

But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?

I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.

by therealdrag04 hours ago|

parent|

[-]

We’re years into the industry leaning into “chain of thought” and then “thinking models” that are based on this premise, forcing more token usage to avoid premature conclusions and notice contradictions (I sometimes see this leak into final output). You may remember in the early days users themselves would have to say “think deeply” or after a response “now check your work” and it would find its own “one shot” mistakes often.

So it must be studied and at least be proven effective in practice to be so universally used now.

Someone else posted a few articles like this in the thread above but there’s probably more and better ones if you search. https://news.ycombinator.com/item?id=47647907

by conception6 hours ago|

parent|

prev|

[-]

I have seen a paper though I can’t find it right now on asking your prompt and expert language produces better results than layman language. The idea of being that the answers that are actually correct will probably be closer to where people who are expert are speaking about it so the training data will associate those two things closer to each other versus Lyman talking about stuff and getting it wrong.

by kubb10 hours ago|

prev|

[-]

This is condescending and wrong at the same time (best combo).

LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.

by prodigycorp9 hours ago|

parent|

[-]

Are you sure about that? Chain of thought does not need to be semantically useful to improve LLM performance. https://arxiv.org/abs/2404.15758

by kubb3 hours ago|

parent|

[-]

If you're misusing LLMs to solve TC^0 problems, which is what the paper is about, then... you also don't need the slop lavine. You can just inject a bunch of filler tokens yourself.

by davidguetta9 hours ago|

parent|

prev|

[-]

still doesn't mean all tokens are useful. it's the point of benchmarks

by prodigycorp9 hours ago|

parent|

[-]

Care to share the benchmarks backing the claims in this repo?

by NiloCK10 hours ago|

prev|

[-]

I agree with this take in general, but I think we need to be prepared for nuance when thinking about these things.

Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.

A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.

More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.

by samus10 hours ago|

parent|

[-]

The solution to that is turning off thinking mode or reducing thinking budget.

by avaer10 hours ago|

prev|

[-]

That was my first thought too -- instead of talk like a caveman you could turn off reasoning, with probably better results.

Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.

So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.

by DrewADesign10 hours ago|

parent|

[-]

> much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.

What do you mean by that? It’s literally text prediction, isn’t it?

by K0balt8 hours ago|

parent|

[-]

It is text prediction. But to predict text, other things follow that need to be calculated. If you can step back just a minute, i can provide a very simple but adjacent idea that might help to intuit the complexity of “ text prediction “ .

I have a list of numbers, 0 to9, and the + , = operators. I will train my model on this dataset, except the model won’t get the list, they will get a bunch of addition problems. A lot. But every addition problem possible inside that space will not be represented, not by a long shot, and neither will every number. but still, the model will be able to solve any math problem you can form with those symbols.

It’s just predicting symbols, but to do so it had to internalize the concepts.

by qsera5 hours ago|

parent|

[-]

>internalize the concepts.

This gives the impression that it is doing something more than pattern matching. I think this kind of communication where some human attribute is used to name some concept in the LLM domain is causing a lot of damage, and ends up inadvertently blowing up the hype for the AI marketing...

by cyanydeez10 hours ago|

parent|

prev|

[-]

There was a paper recently that demonstrated that you can input different human languages and the middle layers of the model end up operating on the same probabilistic vectors. It's just the encoding/decoding layers that appear to do the language management.

So the conclusion was that these middle layers have their own language and it's converting the text into this language and this decoding it. It explains why sometime the models switch to chinese when they have a lot of chinese language inputs, etc.

by DrewADesign10 hours ago|

parent|

[-]

Ok — that sounds more like a theory rather than an open-and-shut causal explanation, but I’ll read the paper.

by trenchgun7 hours ago|

parent|

[-]

You’re a literature cycle behind. ‘Middle-layer shared representations exist’ is the observed phenomenon; ‘why exactly they form’ is the theory.

You are also confusing ‘mechanistic explanation still incomplete’ with ‘empirical phenomenon unestablished.’ Those are not the same thing.

PS. Em dash? So you are some LLM bot trying to bait mine HN for reasoning traces? :D

by DrewADesign4 hours ago|

parent|

[-]

Oh, Jesus Christ. I learned to write at a college with a strict style guide that taught us how to use different types of punctuation to juxtapose two ideas in one sentence. In fact, they did/do a bunch of LLM work so if anyone ever used student data to train models, I’m probably part of the reason they do that.

You sound like you’re trying to sound impressive. Like I said, I’ll read the paper.

by cyanydeez4 hours ago|

parent|

[-]

Congrats on reading.

by DrewADesign54 minutes ago|

parent|

[-]

Sick burn

by skydhash8 hours ago|

parent|

prev|

[-]

Pretty obvious when you think that neural networks operate with numbers and very complex formulas (by combining several simple formulas with various weights). You can map a lot of things to number (words, colors, music notes,…) but that does not means the NN is going to provide useful results.

by DrewADesign4 hours ago|

parent|

[-]

Everything is obvious if you ignore enough of the details/problem space. I’ll read the paper rather than rely on my own thought experiments and assumptions.

by pennaMan10 hours ago|

parent|

prev|

[-]

>It’s literally text prediction, isn’t it?

you are discovering that the favorite luddite argument is bullshit

by ericjmorey9 hours ago|

parent|

[-]

I don't consider these researchers luddites.

https://machinelearning.apple.com/research/illusion-of-think...

https://arxiv.org/abs/2508.01191

by DrewADesign10 hours ago|

parent|

prev|

[-]

Feel free to elucidate if you want to add anything to this thread other than vibes.

by electroglyph10 hours ago|

parent|

[-]

after you go from from millions of params to billions+ models start to get weird (depending on training) just look at any number of interpretability research papers. Anthropic has some good ones.

by HumanOstrich9 hours ago|

parent|

[-]

> things start to get weird

> just look at research papers

You didn't add anything other than vibes either.

by Barbing6 hours ago|

parent|

prev|

[-]

Interesting, what kind of weird?

by DrewADesign9 hours ago|

parent|

prev|

[-]

Getting weird doesn’t mean calling it text prediction is actually ‘bullshit’? Text prediction isn’t pejorative…

by vova_hn210 hours ago|

parent|

prev|

[-]

> instead of talk like a caveman you could turn off reasoning, with probably better results

This is not how the feature called "reasoning" work in current models.

"reasoning" simply let's the model output and then consume some "thinking" tokens before generating the actual output.

All the "fluff" tokens in the output have absolutely nothing to do with "reasoning".

by throw8384949410 hours ago|

parent|

prev|

[-]

You obviously do not speak other languages. Other cultures have different constrains and different grammar.

For example thinking in modern US English generates many thoughts, to keep correct speak at right cultural context (there is only one correct way to say People Of Color, and it changes every year, any typo makes it horribly wrong).

Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.

It is well proven that thinking in Chinese needs far less tokens!

With this caveman mod you strip out most of cultural complexities of anglosphere, make it easier for foreigners and far simpler to digest.

by suddenlybananas10 hours ago|

parent|

[-]

>Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.

This is simply not true.

by throw838494949 hours ago|

parent|

[-]

Well, just take varous english dialects you probably know, there are wast differences. Some strange languages do not even have numbers or recursion.

It is very arrogant to assume, no other language can be more advanced than English.

by mylifeandtimes9 hours ago|

parent|

prev|

[-]

Really? Because if one accepts that computer languages are languages, then it seems that we could identify one or two that are highly specialized in logical conditions etc. Prolog springs to mind.

by malnourish9 hours ago|

parent|

[-]

Yes, really. The concept GP is alluding to is called the Sapir-Worf hypothesis, which is largely non scientific pop linguistics drivel. Elements of a much weaker version have some scientific merit.

Programming languages are not languages in the human brain nor the culture sense.

by skydhash8 hours ago|

parent|

prev|

[-]

We have already proven that all the computing mechanism that those languages derive their semantic forms are equivalent to the Turing Machine. So C and Prolog are only different in terms of notations, not in terms of result.

by strogonoff7 hours ago|

prev|

[-]

A fundamental (but sadly common) error behind “tokens are units of thinking” is antropomorphising the model as a thinking being. That’s a pretty wild claim that requires a lot of proof, and possibly solving the hard problem, before it can be taken seriously.

There’s a less magical model of how LLMs work: they are essentially fancy autocomplete engines.

Most of us probably have an intuition that the more you give an autocomplete, the better results it will yield. However, does this extend to output of the autocomplete—i.e. the more tokens it uses for the result, the better?

It could well be true in context of chain of thought[0] models, in the sense that the output of a preceding autocomplete step is then fed as input to the next autocomplete step, and therefore would yield better results in the end. In other words, with this intuition, if caveman speak is applied early enough in the chain, it would indeed hamper the quality of the end result; and if it is applied later, it would not really save that many tokens.

Willing to be corrected by someone more familiar with NN architecture, of course.

[0] I can see “thinking” used as a term of art, distinct from its regular meaning, when discussing “chain of thought” models; sort of like what “learning” is in “machine learning”.

by ForceBru6 hours ago|

parent|

[-]

IMO "thinking" here means "computation", like running matrix multiplications. Another view could be: "thinking" means "producing tokens". This doesn't require any proof because it's literally what the models do.

As I understand it, the claim is: more tokens = more computation = more "thinking" => answer probably better.

by HarHarVeryFunny7 hours ago|

prev|

[-]

That's going to depend on what model you're using with Claude Code. All of the more recent Anthropic models (4.5 and 4.6) support thinking, so the number of tokens generated ("units of thought") isn't directly tied to the verbosity of input and non-thought output.

However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.

It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequence of a poor player, it'll generate a poor move since that's the best prediction.

Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.

by pxc6 hours ago|

prev|

[-]

If this is true, shouldn't LLMs perform way worse when working in Chinese than in English? Seems like an easy thing to study since there are so many Chinese LLMs that can work in both Cbinese and English.

Do LLMs generally perform better in verbose languages than they do in concise ones?

by reedlaw4 hours ago|

parent|

[-]

Are you saying Chinese is more concise than English? Chinese poetry is concise, but that can be true in any language. For LLMs, it depends on the tokenizer. Chinese models are of course more Chinese-friendly and so would encode the same sentence with fewer tokens than Western models.

by pxc2 hours ago|

parent|

[-]

> Are you saying Chinese is more concise than English?

Yeah, definitely. It lacks case and verb conjugations, plus whole classes of filler words, and words themselves are on average substantially shorter. If you listen to or read a hyper-literal transliteration of Chinese speech into English (you can find fun videos of this on Chinese social media), it even resembles "caveman speech" for those reasons.

If you look at translated texts and compare the English versions to the Chinese ones, the Chinese versions are substantially shorter. Same if you compare localization strings in your favorite open-source project.

It's also part of why Chinese apps are so information-dense, and why localizing to other languages often requires reorganizing the layout itself— languages like English just aren't as information-dense, pixel for pixel.

The difference is especially profound for vernacular Chinese, which is why Chinese people often note that text which "has a machine translation flavor" is over-specified and gratuitously prolix.

Maybe some of this washes out in LLMs due to tokenization differences. But Chinese texts are typically shorter than English texts and it extends to prose as well as poetry.

But yeah this is standard stuff: Chinese is more concise and more contextual/ambiguous. More semantic work is allocated in interpretation than with English, less is allocated in the writing/speaking.

Do you speak Chinese and experience the differences between Chinese and English differently? I'm a native English speaker and only a beginner in Chinese but I've formed these views in discussion with Chinese people who know some English as well.

by reedlaw2 hours ago|

parent|

[-]

Chinese omits articles, verbs aren't conjugated, and individual characters carry more meaning than English letters, but other than those differences I don't have the impression that Chinese communication is inherently more concise. Some forms of official speech are wordy. Writing is denser, but the amount of information conveyed through speech is about the same. There are jokes about ambiguous words or phrases in both Chinese and English. So I was surprised at your take, but no objection to your points above. Ancient Chinese, on the other hand, is extremely concise, but so are other ancient languages like Hebrew, although in a different way. So it seems that ancient languages are compressed but challenging and modern languages have unpacked the compression for ease of understanding.

by pxc25 minutes ago|

parent|

[-]

That's a really interesting point about Ancient Chinese and other ancient scripts. I'd love to learn more about that.

I'm also more curious about tokenizers for LLMs than I've ever been before, both for Chinese and English. I feel like to understand I'll need to look at some concrete examples, since sometimes tokenization can be per word or per character or sometimes chunks that are in between.

by 8 hours ago|

prev|

[-]

deleted

by baq11 hours ago|

prev|

[-]

Do you know of evals with default Claude vs caveman Claude vs politician Claude solving the same tasks? Hypothesis is plausible, but I wouldn’t take it for granted

by marginalia_nu8 hours ago|

prev|

[-]

I wonder if a language like Latin would be useful.

It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.

e.g.

"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).

by mike_hearn5 hours ago|

parent|

[-]

I think speculative decoding eliminates a lot of the savings people imagine they're getting from making LLMs use strange languages.

by dmboyd7 hours ago|

parent|

prev|

[-]

Words <> tokens

by zozbot2349 hours ago|

prev|

[-]

Grug says you quite right, token unit thinking, but empty words not real thinking and should avoid. Instead must think problem step by step with good impactful words.

by 10 hours ago|

prev|

[-]

deleted

by raincole10 hours ago|

prev|

[-]

When it comes to LLM you really cannot draw conclusions from first principles like this. Yes, it sounds reasonable. And things in reality aren't always reasonable.

Benchmark or nothing.

by samus10 hours ago|

parent|

[-]

There have been papers about introducing thinking tokens in intermediary layers that get stripped from the output.

by hackerInnen8 hours ago|

prev|

[-]

You are absolutely right! That is exactly the reason why more lines of code always produce a better program. Straight on, m8!

by ZoomZoomZoom5 hours ago|

parent|

[-]

This might be not so far from the truth, if you count total loc written and rewritten during the development cycle, not just the final number.

Not everybody is Dijkstra.

by andai11 hours ago|

prev|

[-]

I remember a while back they found that replacing reasoning tokens with placeholders ("....") also boosted results on benchies.

But does talk like caveman make number go down? Less token = less think?

I also wondered, due to the way LLMs work, if I ask AI a question using fancy language, does that make it pattern match to scientific literature, and therefore increase the probability that the output will be true?

by afro8810 hours ago|

prev|

[-]

IIUC this doesn't make the LLM think in caveman (thinking tokens). It just makes the final output show in caveman.

by Demiurg0826 hours ago|

prev|

[-]

CoT token are usually controled via 'extended thinking' or 'adapted thinking'. CoT tokens are usually not affected by the system prompt. There is an effort parameter, though, which states to have an effect on accuracy for over all token consumption.

https://platform.claude.com/docs/en/build-with-claude/extend...

by bitexploder6 hours ago|

parent|

[-]

This helps, but the original prompt is still there. The system prompt is still influencing these thinking blocks. They just don’t end up clogging up your context. The system prompt sits at the very top of the context hierarchy. Even with isolated "thinking" blocks, the reasoning tokens are still autoregressively conditioned on the system instructions. If the system prompt forces "caveman speak" the model's attention mechanisms are immediately biased toward simpler, less coherent latent spaces. You are handicapping the vocabulary and syntax it uses inside its own thinking process, which directly throttles its ability to execute high-level logic.

Nothing on that page indicates otherwise.

by xgulfie8 hours ago|

prev|

[-]

Ah so obviously making the LLM repeat itself three times for every response it will get smarter

by agumonkey10 hours ago|

prev|

[-]

How do we know if a token sits at an abstract level or just the textual level ?

by PufPufPuf9 hours ago|

prev|

[-]

You mention thinking tokens as a side note, but their existence invalidates your whole point. Virtually all modern LLMs use thinking tokens.

by cyanydeez10 hours ago|

prev|

[-]

It's not "units of thinking" its "units of reference"; as long as what it produces references the necessary probabilistic algorithms, itll do just fine.

by otabdeveloper49 hours ago|

prev|

[-]

LLMs don't think at all.

Forcing it to be concise doesn't work because it wasn't trained on token strings that short.

by HumanOstrich9 hours ago|

parent|

[-]

> Forcing it to be concise doesn't work because it wasn't trained on token strings that short.

This is a 2023-era comment and is incorrect.

by Barbing6 hours ago|

parent|

[-]

Anything I can read that would settle the debate?

by otabdeveloper48 hours ago|

parent|

prev|

[-]

LLMs architectures have not changed at all since 2023.

> but mmuh latest SOTA from CloudCorp (c)!

You don't know how these things work and all you have to go on is marketing copy.

by HumanOstrich6 hours ago|

parent|

[-]

Yea you don't know anything about LLM architectures. They often change with each model release.

You also aren't aware that there's more to it than "LLM architecture". And you're rather confident despite your lack of knowledge.

You're like the old LLMs before ChatGPT was released that were kinda neat, but usually wrong and overconfident about it.

by rafram8 hours ago|

parent|

prev|

[-]

They’re able to solve complex, unstructured problems independently. They can express themselves in every major human language fluently. Sure, they don’t actually have a brain like we do, but they emulate it pretty well. What’s your definition of thinking?

by otabdeveloper46 hours ago|

parent|

[-]

When OP wrote about LLMs "thinking" he implied that they have an internal conceptual self-reflecting state. Which they don't, they *are* merely next token predicting statistical machines.

by rafram5 hours ago|

parent|

[-]

This was true in 2023.

by fkgmeqnb5 hours ago|

parent|

[-]

And it still is today.

by kogold9 hours ago|

prev|

[-]

[flagged]

by dang2 hours ago|

parent|

[-]

"Don't be snarky."

https://news.ycombinator.com/newsguidelines.html

by Chance-Device8 hours ago|

parent|

prev|

[-]

Let’s see, I think these pretty much map out a little chronology of the research:

https://arxiv.org/abs/2112.00114 https://arxiv.org/abs/2406.06467 https://arxiv.org/abs/2404.15758 https://arxiv.org/abs/2512.12777

First that scratchpads matter, then why they matter, then that they don’t even need to be meaningful tokens, then a conceptual framework for the whole thing.

by bsza7 hours ago|

parent|

[-]

I dont’t see the relevance, the discussion is over whether boilerplate text that occurs intermittently in the output purely for the sake of linguistic correctness/sounding professional is of any benefit. Chain of thought doesn’t look like that to begin with, it’s a contiguous block of text.

by Chance-Device7 hours ago|

parent|

[-]

To boil it down: chain of thought isn’t really chain of thought, it’s just more token generation output to the context. The tokens are participating in computations in subsequent forward passes that are doing things we don’t see or even understand. More LLM generated context matters.

by bitexploder7 hours ago|

parent|

prev|

[-]

That is not how CoT works. It is all in context. All influenced by context. This is a common and significant misunderstanding of autoregressive models and I see it on HN a lot.

by j16sdiz7 hours ago|

parent|

prev|

[-]

I don't see the relevance -- and casually dismiss years of researches without even trying to read those paper.

by bitexploder7 hours ago|

parent|

prev|

[-]

That "unproven claim" is actually a well-established concept called Chain of Thought (CoT). LLMs literally use intermediate tokens to "think" through problems step by step. They have to generate tokens to talk to themselves, debug, and plan. Forcing them to skip that process by cutting tokens, like making them talk in caveman speak, directly restricts their ability to reason.

by ShowalkKama9 hours ago|

parent|

prev|

[-]

the fact that more tokens = more smart should be expected given cot / thinking / other techniques that increase the model accuracy by using more tokens.

Did you test that ""caveman mode"" has similar performance to the ""normal"" model?

by Garlef9 hours ago|

parent|

[-]

Yes but: If the amount is fixed, then the density matters.

A lot of communication is just mentioning the concepts.

by bitexploder6 hours ago|

parent|

prev|

[-]

That is part of it. They are also trained to think in very well mapped areas of their model. All the RHLF, etc. tuned on their CoT and user feedback of responses.

by 8 hours ago|

parent|

prev|

[-]

deleted

by ano-ther7 hours ago|

parent|

prev|

[-]

Looking at the skill.md wouldn’t this actually increase token use since the model now needs to reformat its output?

Funny idea though. And I’d like to see a more matter-of-fact output from Claude.

by collingreen2 hours ago|

parent|

prev|

[-]

I assume you're a human but wow this is the type of forum bot I could really get behind.

Take it a step further and do kind of like that xkcd where you try to post and it rewrites it like this and if you want the original version you have to write a justification that gets posted too.

Chef's kiss

by mynegation9 hours ago|

parent|

prev|

[-]

No, let me rephrase it for you. “tokens used for think. Short makes model dumb”

by freehorse8 hours ago|

parent|

[-]

Talk a lot not same as smart

by taneq7 hours ago|

parent|

[-]

Think before talk better though

by freehorse6 hours ago|

parent|

[-]

Think makes smart. But think right words makes smarter, not think more words. Smart is elucidate structure and relationships with right words.

by ben_w3 hours ago|

parent|

[-]

think make smart, llm approximate "think" with context, llm not smart ever but sometimes less dumb with more word

by huflungdung7 hours ago|

parent|

prev|

[-]

[dead]

by estearum8 hours ago|

parent|

prev|

[-]

Can't you know that tokens are units of thinking just by... like... thinking about how models work?

by gchamonlive8 hours ago|

parent|

[-]

Can't you just know that the earth is the center of the world by... like... just looking at how the world works?

by estearum7 hours ago|

parent|

[-]

Actually you'd trivially disprove that claim if you're starting from mechanistic knowledge of how orbits work, like how we have mechanistic knowledge of how LLMs work.

by gchamonlive7 hours ago|

parent|

[-]

You have empirical observations, like replicating a fixed set of inner layers to make it think longer, or that you seem to have encode and decode layers. But exactly why those layers are the way they are, how they come together for emergent behaviour... Do we have mechanistic knowledge of that?

by ben_w3 hours ago|

parent|

[-]

I think we've *only* got the mechanism, not the implications.

Compare with fluid dynamics; it's not hard to write down the Navier–Stokes equations, but there's a million dollars available to the first person who can prove or give a counter-example of the following statement:

  In three space dimensions and time, given an initial velocity field, there exists a vector velocity and a scalar pressure field, which are both smooth and globally defined, that solve the Navier–Stokes equations.

- https://en.wikipedia.org/wiki/Navier–Stokes_existence_and_sm...

by xpe6 hours ago|

parent|

prev|

[-]

Though the above exchange felt a tiny bit snarky, I think the conversation did get more interesting as it went on. I genuinely think both people could probably gain by talking more -- or at least figuring out a way to move fast the surface level differences. Yes, humans designed LLMs. But this doesn't mean we understand their implications even at this (relatively simple) level.

by xpe8 hours ago|

parent|

prev|

[-]

> Can't you know that tokens are units of thinking just by... like... thinking about how models work?

Seems reasonable, but this doesn't settle probably-empirical questions like: (a) to what degree is 'more' better?; (b) how important are filler words? (c) how important are words that signal connection, causality, influence, reasoning?

by estearum7 hours ago|

parent|

[-]

Right, there's probably something more subtle like "semantic density within tokens is how models think"

So it's probably true that the "Great question!---" type preambles are not helpful, but that there's definitely a lower bound on exactly how primitive of a caveman language we're pushing toward.

by taneq7 hours ago|

prev|

[-]

More concise is dumber. Got it.

by Rexxar10 hours ago|

prev|

[-]

  > Someone didn't get the memo that for LLMs, tokens are units of thinking.

Where do you get this memo ? Seems completely wrong to me. More computation does not translate to more "thinking" if you compute the wrong things (ie things that contribute significantly to the final sentence meaning).

by staminade10 hours ago|

parent|

[-]

That’s why you need filler words that contribute little to the sentence meaning but give it a chance to compute/think. This is part of why humans do the same when speaking.

by dTal5 hours ago|

parent|

[-]

The LLM has no accessible state beyond its own output tokens; each pass generates a single token and does not otherwise communicate with subsequent passes. Therefore all information calculated in a pass must be encoded into the entropy of the output token. If the only output of a thinking pass is a dumb filler word with hardly any entropy, then all the thinking for that filler word is forgotten and cannot be reconstructed.

by jaccola10 hours ago|

parent|

prev|

[-]

Do you have any evidence at all of this? I know how LLMs are trained and this makes no sense to me. Otherwise you'd just put filler words in every input

e.g. instead of: "The square root of 256 is" you'd enter "errr The er square um root errr of 256 errr is" and it would miraculously get better? The model can't differentiate between words you entered and words it generated its self...

by muzani9 hours ago|

parent|

[-]

It's why it starts with "You're absolutely right!" It's not to flatter the user. It's a cheap way to guide the response in a space where it's utilizing the correction.

by mike_hearn5 hours ago|

parent|

prev|

[-]

People have researched pause tokens for this exact reason.

by staminade9 hours ago|

parent|

prev|

[-]

What do you think chain of thought reasoning is doing exactly?

by lijok10 hours ago|

parent|

prev|

[-]

You’re conflating training and inference