undefined

points

[-]

There is currently very little evidence that morphological tokenizers help model performance [1]. For languages like German (where words get glued together) there is a bit more evidence (eg a paper I worked on [2]), but overall I start to suspect the bitter lesson is also true for tokenization.

[1] https://arxiv.org/pdf/2507.06378

[2] https://pieter.ai/bpe-knockout/

by sigmoid108 hours ago|

parent|

[-]

I never understood why people want this in the first place. Sure, making this step more human explainable would be nice and possibly even fix some very particular problems for particular languages, but it directly goes against the primary objective of a tokenizer: Optimizing sequence length vs. vocabulary size. This is a pretty clear and hard optimization target and the best you can do is make sure that your tokenizer training set more closely mimics your training and ultimately your inference data. Putting english or german grammar in there by force will only degrade every other language in the tokenzier, while we already know that limiting additional languages will hurt overall model performance. And the belief that you can encode a dataset of trillions of tokens into a more efficient vocabulary than a machine is kind of weird tbh. People have also accepted since the early convnet days that the best encoding representation for images in machine learning is not a human understandable one. Same goes for audio. So why should text be any different? If you really think so, you might also wanna have a go at feature engineering images. And it's not like people haven't tried that. But they all eventually learned their lesson.

by wongarsu6 hours ago|

parent|

[-]

We usually build the tokenizer by optimizing for one goal (space-efficient encoding of text), then use it in a model that is trained for an entirely different goal (producing good text, "reasoning", "coding", etc). It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

That's what all these attempts boil down to. They don't presume to be able to find a more space-efficient encoding by hand, they assume that the optimization goal for the tokenizer was wrong and they can do better by adding some extra rules. And this isn't entirely without precendent, most tokenizers have a couple of "forced" tokens that were not organically discovered. Moving around how digits are grouped in the tokenizer is another point where wins have been shown.

This is where projects like nanochat are really valuable for quickly and (relatively) cheaply trying out various tweaks

by sigmoid106 hours ago|

parent|

[-]

>It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

Except that is exactly what research has shown. Besides, the tokenizer's training goal is literally just to encode text efficiently with fewer tokens by increasing the vocabulary, which obviously directly benefits the attention mechanism if you look at the dimensions of involved matrices. The biggest issues so far have stemmed from variances between tokenizer and LLM training sets [1] and the fact that people primarily work with character based text and not word-part based text (even though that gets muddy when you look at what is actually happening in the brain) when doing anything in writing.

[1] https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

by amelius3 hours ago|

parent|

prev|

[-]

If you want to make it more human-explainable, then ditch the entire tokenizer and just feed the models raw characters. Because now there is nothing to explain.

by rrr_oh_man7 hours ago|

parent|

prev|

[-]

I feel it's a case of "This random word generator can't possibly be smarter than I?!"

by dannyw9 hours ago|

prev|

[-]

LLMs are explicitly designed to handle, and also possibly 'learn' from different tokens encoding similar information. I found this video from 3blue1brown very informative: https://www.youtube.com/watch?v=wjZofJX0v4M

Also, think about how a LLM would handle different languages.

by friendzis9 hours ago|

prev|

[-]

This is such a superficial, English-centric take, but it might as well be true. It seems to me that in non-english languages the models, especially chatgpt, have suffered in the declension department and output words in cases that do not fit the context.

I have just ran an experiment: I have taken a word and asked models (chatgpt, gemini and claude) to explode it into parts. The caveat is that it could either be root + suffix + ending or root + ending. None of them realized this duality and have taken one possible interpretation.

Any such approach to tokenizing assumes context free (-ish) grammar, which is just not the case with natural languages. "I saw her duck" (and other famous examples) is not uniquely tokenizable without a broader context, so either the tokenizer has to be a model itself or the model has to collapse the meaning space.

by orbital-decay6 hours ago|

parent|

[-]

Current models understand different tokenization variants perfectly, e.g. leading space vs no leading space vs one character per token. It doesn't even affect evals and behchmarks. They're also good at languages that have very flexible word formation (e.g. Slavic) and can easily invent pretty natural non-existent words without being restricted by tokenization. This ability took a bit of a hit with recent RL and code generation optimizations, but this is not related to tokenization.

>None of them realized this duality and have taken one possible interpretation.

I suspect this happens due to mode collapse and has nothing to do with the tokenization. Try this with a base model.

by fooker9 hours ago|

prev|

[-]

This is how language models have worked since their inception, and has been steadily improved since about 2018.

See embedding models.

> they removed the tokenizer altogether

This is an active research topic, no real solution in sight yet.

by frb6 hours ago|

prev|

[-]

I was looking into morpheme tokenization approach, but went even more radical with building a semantic primitive tokenizer [1], i.e. kill, killed, killer would all share the same semantic connection and tokens, e.g. [KILL], [KILL, BEFORE], [KILL, SOMEONE].

It’s based on semantic primitives (Wierzbicka NSM) and emoji (the fun idea that got me interested in this in the first place).

So far I’ve tested 6 iterations and it trains and responds well with a 10k vocab, but the grammar came out rougher. Working on 8th iteration, mainly to improve the grammar and language. Turns out the smaller vocab couldn’t be maintained and all improvements get us back in the ballpark of the 32k vocab size. Further testing is still outstanding for this week.

[1] https://github.com/frane/primoji

by nl8 hours ago|

prev|

[-]

This is almost certainly wrong.

Case sensitive language models have been a thing since way before neural language models. I was using them with boosted tree models at least ten years ago, and even my Java NLP tool did this twenty years ago (damn!). There is no novelty there of course - I based that on PG's "A Plan for Spam".

See for example CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.fe...

The bitter lesson says that you are much better off just adding more data and learning the tokenizer and it will be better.

It's not impossible that the new Opus tokenizer is based on something learnt during Mythos pre-training (maybe it is *the learned Mythos tokenizer?%), and it seems likely that the Mythos pre-training run is the most data ever trained on.

Putting an inductive bias in your tokenizer seems just a terrible idea.

by yorwba5 hours ago|

parent|

[-]

Anthropic was already special-casing case-folding in their tokenizers before this recent change: https://transformer-circuits.pub/2025/attribution-graphs/met... "The tokenizer the model was trained with uses a special “Caps Lock” token" (⇪). Their visualizations for Claude 3.5 Haiku also show the Title Case token (↑).

This is similar to what the TokenMonster tokenizer does: https://github.com/alasdairforsythe/tokenmonster

by kouteiheika7 hours ago|

parent|

prev|

[-]

> This is almost certainly wrong.

So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

> Putting an inductive bias in your tokenizer seems just a terrible idea.

You're already effectively doing this by the sheer fact of using a BPE tokenizer, and especially with modern BPE-based LLM tokenizers[1]. I agree trying to bake this manually in a tokenizer is most likely not a good idea, but I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

[1] Example from Qwen3.6 tokenizer:

    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      }
    ]
  },

by nl7 hours ago|

parent|

[-]

> So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

Just modeling whitespace as its own token would seem to explain the increase.

> Qwen3.6 tokenizer: "pretokenizer"

That's the pre-tokenizer, not the tokenizer. That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.

> I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. See the BPE paper: https://arxiv.org/abs/1508.07909

BPE already learns morphology because it sees the raw bytes.

by kouteiheika6 hours ago|

parent|

[-]

> That's the pre-tokenizer, not the tokenizer.

Yes, it's an extra tokenizer which runs before the learned tokenizer and injects an inductive bias into it.

> That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.

While it does indeed speed up training of the tokenizer, no, it isn't mostly just a performance optimization? It injects a clear cut inductive bias into the tokenizer (split by words, split by punctuation, don't merge words and numbers, etc. -- is that not an inductive bias?), and for some languages (e.g. Asian languages which don't use spaces) the "it's just for performance" argument doesn't make as much sense because there it has no spaces to split on, so the chunks of text are much longer (although it does still split on punctuation, etc.).

Can we not agree that the absolutist position of "Putting an inductive bias in your tokenizer seems just a terrible idea." (as in - any inductive bias) is not actually true, especially since people are actually doing it?

Note, I'm not actually arguing that hand-crafted morphological tokenizers are better. (Which is the straw man many people seem to be replying to.) I'm just arguing that it should be feasible to train your tokenizer in a more morphologically aware way, because BPE doesn't do that.

> The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. [..] BPE already learns morphology because it sees the raw bytes.

The reason everyone went to BPE is because of the bitter lesson (and because you don't have to hardcode your whole vocabulary, i.e. no UNK tokens), and not because it's particularly good at learning the morphology of the actual text. It's trivial to show countless examples where it fails to do so.

by anonymoushn9 hours ago|

prev|

[-]

their old tokenizer performed some space collapsing that allowed them to use the same token id for a word with and without the leading space (in cases where the context usually implies a space and one is not present, a "no space" symbol is used).