undefined

points

[-]

Haven't watched the videos yet, but from the slides, it looks like part of the issue he was talking about was encodings (there's a slide illustrating UTF-16LE ve UTF-16BE, for example). Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again. It has a much larger set of valid characters, but if you receive a text file without knowing its encoding, you can just assume it's UTF-8 and have a 99.7% chance of being right.

FINALLY.

by bmitc1 hours ago|

parent|

[-]

The point is, a lot of work went into making that happen. I.e., plain text as it is today is not some inherent property of computing. It is a binary protocol and displaying text through fonts is also not a trivial matter.

So my question is: what are we leaving on the table by over focusing on text? What about graphs and visual elements?

by ButlerianJihad1 hours ago|

parent|

prev|

[-]

vaxocentrism, or “All the World’s a VAX”

http://www.catb.org/esr/jargon/html/V/vaxocentrism.html

by thaumasiotes4 hours ago|

parent|

prev|

[-]

> Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again.

Whenever I hear this, I hear "all text files should be 50% larger for no reason".

UTF-8 is pretty similar to the old code page system.

by mort964 hours ago|

parent|

[-]

Hm? UTF-8 encodes all of ASCII with one byte per character, and is pretty efficient for everything else. I think the only advantage UTF-16 has over UTF-8 is that some ranges (such as Han characters I believe?) are often 3 bytes of UTF-8 while they're 2 bytes of UTF-16. Is that your use case? Seems weird to describe that as "all text files" though?

by thaumasiotes4 hours ago|

parent|

[-]

UTF-8 encodes European glyphs in two bytes and oriental glyphs in three bytes. This is due to the assumption that you're not going to be using oriental glyphs. If you are going to use them, UTF-8 is a very poor choice.

by mort963 hours ago|

parent|

[-]

UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply.

Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts?

by thaumasiotes3 hours ago|

parent|

[-]

> Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information?

Yes. Note that this is already how Unicode is supposed to work. See e.g. https://en.wikipedia.org/wiki/Byte_order_mark .

A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats.

> Most European languages use variations of the latin alphabet

If you want to interpret "variations of Latin" really, really loosely, that's true.

Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

by harmonics3 hours ago|

parent|

[-]

As someone who has been using Cyrillic writing all my life, I've never noticed this bloat you're speaking of, honestly...

Maybe if you're one of those AI behemots who works with exabytes of training data, it would make some sense to compress it down by less than 50% (since we're using lots of Latin terms and acronyms and punctuation marks which all fit in one byte in UTF-8).

On the web and in other kinds of daily text processing, one poorly compressed image or one JavaScript-heavy webshite obliterates all "savings" you would have had in that week by encoding text in something more efficient.

It's the same with databases. I've never seen anyone pick anything other than UTF-8 in the last 10 years at least, even though 99% of what we store there is in Cyrillic. I sometimes run into old databases, which are usually Oracle, that were set up in the 90s and never really upgraded. The data is in some weird encoding that you haven't heard of for decades, and it's always a pain to integrate with them.

I remember the days of codepages. Seeing broken text was the norm. Technically advanced users would quickly learn to guess the correct text encoding by the shapes of glyphs we would see when opening a file. Do not want.

by mort963 hours ago|

parent|

prev|

[-]

UTF-8 does not require a byte order mark. The byte order mark is a technical necessity born from UTF-16 and a desire to store UTF-16 in a machine's native endianness.

The byte order mark has has no relation to code pages.

I don't think you know what you're talking about and I do not think further engagement with you is fruitful. Bye.

EDIT: okay since you edited your comment to add the part about Greek and cryllic after I responded, I'll respond to that too. Notice how I did not say "all European languages". Norwegian, Swedish, French, Danish, Spanish, German, English, Polish, Italian, and many other European languages have writing systems where typical texts are "mostly ASCII with a few special symbols and diacritics here and there". Yes, Greek and cryllic are exceptions. That does not invalidate my point.

by lelanthran4 hours ago|

prev|

[-]

I can't tell what the argument is just from the slideshow. The main point appears to be that code pages, UTF-16, etc are all "plain text" but not really.

If that really was the argument, then it is, in 2026, obsolete; utf-8 is everywhere.

by benj1114 hours ago|

parent|

[-]

He has a YouTube channel, there's a talk on there.

He also discusses code pages etc.

I don't think the thesis is wrong. Eg when I think plain text I think ASCII, so we're already disagreeing about what 'plain text' is. His point isn't that we don't have a standard, it's that we've had multiple standards over what we think is the most basic of formats, with lots of hidden complications.

by zahlman1 hours ago|

prev|

[-]

Nice. I've used the phrase before, with the vague notion that a proper talk must already exist.

by carra3 hours ago|

prev|

[-]

I read that article long time ago, and for me it's a hard disagree. A system as complex and quirky as Unicode can never be considered "plain", and even today it is common for many apps that something Unicode-related breaks. ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.

And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.

by d-us-vb1 hours ago|

parent|

[-]

I feel like that isn’t exactly a very useful definition of plaintext. If you mean “ASCII” say ASCII.

Plain text is text intended to be interpreted as bytes that map simply to characters. Complexity is irrelevant.