undefined

points

by WalterBright4 hours ago |

comments

by pvillano4 hours ago|

[-]

Unicode is "designed to support the use of text in all of the world's writing systems that can be digitized"

Unicode needs tab, space, form feed, and carriage return.

Unicode needs U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK to switch between left-to-right and right-to-left languages.

Unicode needs U+115F HANGUL CHOSEONG FILLER and U+1160 HANGUL JUNGSEONG FILLER to typeset Korean.

Unicode needs U+200C ZERO WIDTH NON-JOINER to encode that two characters should not be connected by a ligature.

Unicode needs U+200B ZERO WIDTH SPACE to indicate a word break opportunity without actually inserting a visible space.

Unicode needs MONGOLIAN FREE VARIATION SELECTORs to encode the traditional Mongolian alphabet.

by WalterBright3 hours ago|

parent|

[-]

[flagged]

by bulbar3 hours ago|

parent|

[-]

That's a very narrow view of the world. One example: In the past I have handled bilingual english-arabic files with switches within the same line and Arabic is written from left to right.

There are also languages that are written from to to bottom.

Unicode is not exclusively for coding, to the contrary, pretty sure it's only a small fraction of how Unicode is used.

> Somehow people didn't need invisible characters when printing books.

They didn't need computers either so "was seemingly not needed in the past" is not a good argument.

by WalterBright54 minutes ago|

parent|

[-]

> That's a very narrow view of the world.

Yes, it is. Unicode has undergone major mission creep, thinking it is now a font language and a formatting language. Naturally, this has lead to making it a vector for malicious actors. (The direction reversing thing has been used to insert malicious text that isn't visible to the reader.)

> Unicode is not exclusively for coding

I never mentioned coding.

> They didn't need computers

Unicode is for characters, not formatting. Formatting is what HTML is for, and many other formatting standards. Neither is it for meaning.

by pibaker1 hours ago|

parent|

prev|

[-]

> That's a very narrow view of the world.

But not one that would surprise anyone familiar with WalterBright's antics on this website…

by WalterBright7 minutes ago|

parent|

[-]

At least my antics do not include insulting people.

by jmusall2 hours ago|

parent|

prev|

[-]

The fact is that there were so many character sets in use before Unicode because all these things were needed or at least wanted by a lot of people. Here's a great blog post by Nikita Prokopov about it: https://tonsky.me/blog/unicode/

by WalterBright3 hours ago|

parent|

prev|

[-]

    Look Ma
    xt! N !
    e tee S
    T larip

(No Unicode needed.)

by chongli2 hours ago|

parent|

prev|

[-]

Unicode is for human beings, not machines.

by WalterBright51 minutes ago|

parent|

[-]

How does invisible Unicode text fit into that?

by chongli49 seconds ago|

parent|

[-]

It's not text, it's control characters, which have always been in character sets going back to ASCII.

by luke-stanley3 hours ago|

prev|

[-]

So we need a new standard problem due to the complexity of the last standard? Isn't unicode supposed to be a superset of ASCII, which already has control characters like new space, CR, and new lines? xD

by WalterBright3 hours ago|

parent|

[-]

The only ones people use any more are newline and space. A tab key is fine in your editor, but it's been more or less abandoned as a character. I haven't used a form feed character since the 1970s.

by tetha2 hours ago|

prev|

[-]

That ship has sailed, but I consider Unicode a good thing, yet I consider it problematic to support Unicode in every domain.

I should be able to use Ü as a cursed smiley in text, and many more writing systems supported by Unicode support even more funny things. That's a good thing.

On the other hand, if technical and display file names (to GUI users) were separate, my need for crazy characters in file names, code bases and such are very limited. Lower ASCII for actual file names consumed by technical people is sufficient to me.

by WalterBright49 minutes ago|

parent|

[-]

> That ship has sailed

Sure, but more crazy stuff gets added all the time.

by WalterBright4 hours ago|

prev|

[-]

Another dum dum Unicode idea is having multiple code points with identical glyphs.

Rule of thumb: two Unicode sequences that look identical when printed should consist of the same code points.

by estebank2 hours ago|

parent|

[-]

If anything, Unicode should have had more disambiguated characters. Han unification was a mistake, and lower case dotted Turkish i and upper case dotless Turkish I should exist so that toUpper and toLower didn't need to know/guess at a locale to work correctly.

by WalterBright45 minutes ago|

parent|

[-]

Characters should not have invisible semantics.

by nswango4 hours ago|

parent|

prev|

[-]

So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

by WalterBright3 hours ago|

parent|

[-]

> So you think that the letters in the Greek and Cyrillic alphabets which are printed identically to the Latin A should not exist?

Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

> And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?

Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

Those Unicode homonyms are a solution looking for a problem.

by bawolff1 hours ago|

parent|

[-]

> Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.

Do you think 1, l and I should be encoded as the same character, or does this logic only extend to characters pesky foreigners use.

by WalterBright44 minutes ago|

parent|

[-]

They are visually distinct to the reader.

by Yokohiii2 hours ago|

parent|

prev|

[-]

Unicode is about semantics not appearance. If you don't need semantics then use something different.

by WalterBright42 minutes ago|

parent|

[-]

> Unicode is about semantics not appearance.

And that's where it went off the rails into lala land. 'a' can have all kinds of distinct meanings. How are you going to make that work? It's hopeless.

by Muromec2 hours ago|

parent|

prev|

[-]

>Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?

I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

>should not be about semantic meaning,

It's always better to be able to preserve more information in a text and not less.

by WalterBright2 minutes ago|

parent|

[-]

> I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.

They look visually distinct to me. I don't get your point.

> It's always better to be able to preserve more information in a text and not less.

Text should not lose information by printing it and then OCR'ing it.

by Yokohiii2 hours ago|

parent|

prev|

[-]

What about numbers? Would they be assigned to arabic only? I guess someone will be offended by that.

While at it we could also unify I, | and l. It's too confusing sometimes.

by WalterBright19 seconds ago|

parent|

[-]

> While at it we could also unify I, | and l. It's too confusing sometimes.

They render differently, so it's not a problem.

by wcoenen2 hours ago|

parent|

prev|

[-]

As far as I know, glyphs are determined by the font and rendering engine. They're not in the Unicode standard.

by WalterBright10 minutes ago|

parent|

[-]

Fraktur (font) and italic (rendering) are in the Unicode standard, although Hackernews will not render them.

by jeltz4 hours ago|

parent|

prev|

[-]

I don't think that would help much. There are also characters which are similar but not the same and I don't think humans can spot the differences unless they are actively looking for them which most of the time people are not. If only one of two glyphs which are similar appear in the text nobody would likely notice, expectation bias will fuck you over.

by WalterBright3 hours ago|

parent|

[-]

I wonder how anybody got by with printed books.

by eviks2 hours ago|

prev|

[-]

So you'd remove space and tab from Unicode?

by 3 hours ago|

prev|

[-]

deleted

by moritzruth4 hours ago|

prev|

[-]

greatidea,whoneedsspacesanyway

by WalterBright4 hours ago|

parent|

[-]

Spaces appear on a printout.

by bawolff1 hours ago|

prev|

[-]

Good luck with that given there are invisible characters in ascii.

Also this attack doesnt seem to use invisible characters just characters that dont have an assigned meaning.

by abujazar4 hours ago|

prev|

[-]

Invisible characters are there for visible characters to be printed correctly...

by WalterBright3 hours ago|

parent|

[-]

I'll grant that a space and a newline are necessary. The rest, nope.

by abujazar2 hours ago|

parent|

[-]

You're talking about a subset of ASCII then. Unicode is supposed to support different languages and advanced typography, for which those characters are necessary. You can't write e.g. Arabic or Hebrew without those "unnecessary" invisible characters.

by WalterBright18 minutes ago|

parent|

[-]

Please explain why an invisible zero width "character" is necessary.

by uhoh-itsmaciek4 hours ago|

prev|

[-]

>Remove them from Unicode.

Do you honestly think this is a workable solution?

by WalterBright3 hours ago|

parent|

[-]

Yes, absolutely. See my other replies.