Unicode needs tab, space, form feed, and carriage return.
Unicode needs U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK to switch between left-to-right and right-to-left languages.
Unicode needs U+115F HANGUL CHOSEONG FILLER and U+1160 HANGUL JUNGSEONG FILLER to typeset Korean.
Unicode needs U+200C ZERO WIDTH NON-JOINER to encode that two characters should not be connected by a ligature.
Unicode needs U+200B ZERO WIDTH SPACE to indicate a word break opportunity without actually inserting a visible space.
Unicode needs MONGOLIAN FREE VARIATION SELECTORs to encode the traditional Mongolian alphabet.
There are also languages that are written from to to bottom.
Unicode is not exclusively for coding, to the contrary, pretty sure it's only a small fraction of how Unicode is used.
> Somehow people didn't need invisible characters when printing books.
They didn't need computers either so "was seemingly not needed in the past" is not a good argument.
Yes, it is. Unicode has undergone major mission creep, thinking it is now a font language and a formatting language. Naturally, this has lead to making it a vector for malicious actors. (The direction reversing thing has been used to insert malicious text that isn't visible to the reader.)
> Unicode is not exclusively for coding
I never mentioned coding.
> They didn't need computers
Unicode is for characters, not formatting. Formatting is what HTML is for, and many other formatting standards. Neither is it for meaning.
But not one that would surprise anyone familiar with WalterBright's antics on this website…
Look Ma
xt! N !
e tee S
T larip
(No Unicode needed.)I should be able to use Ü as a cursed smiley in text, and many more writing systems supported by Unicode support even more funny things. That's a good thing.
On the other hand, if technical and display file names (to GUI users) were separate, my need for crazy characters in file names, code bases and such are very limited. Lower ASCII for actual file names consumed by technical people is sufficient to me.
Sure, but more crazy stuff gets added all the time.
Rule of thumb: two Unicode sequences that look identical when printed should consist of the same code points.
And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?
Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.
> And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?
Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?
Those Unicode homonyms are a solution looking for a problem.
Do you think 1, l and I should be encoded as the same character, or does this logic only extend to characters pesky foreigners use.
And that's where it went off the rails into lala land. 'a' can have all kinds of distinct meanings. How are you going to make that work? It's hopeless.
I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.
>should not be about semantic meaning,
It's always better to be able to preserve more information in a text and not less.
They look visually distinct to me. I don't get your point.
> It's always better to be able to preserve more information in a text and not less.
Text should not lose information by printing it and then OCR'ing it.
While at it we could also unify I, | and l. It's too confusing sometimes.
They render differently, so it's not a problem.
Also this attack doesnt seem to use invisible characters just characters that dont have an assigned meaning.
Do you honestly think this is a workable solution?