undefined

points

[-]

The page is declared as ISO 8859-1, but the actual bytes of the text appear to be UTF-8. In UTF-8, characters from U+0080 to U+00AF happen to be encoded as C2 <codepoint value>. For example, U+0092 is encoded as C2 92.

C2 in ISO 8859-1 is ”Â”. U+0092 is the control code Private Use 2 in Unicode, and 92 is the same in ISO 8859-1. However, the standard Western Windows code page 1252 extends ISO 8859-1 by assigning “’” (right single quotation mark) to 92.

HTML5/WHATWG requires an ISO 8859-1 charset declaration to be interpreted as Windows-1252 (https://blog.whatwg.org/the-road-to-html-5-character-encodin...), hence the displayed result is “Â’”.

The original Windows-1252 content must have previously been converted to UTF-8 under the assumption that the source is ISO 8859-1, i.e. mapping 92 to U+0092 (Private Use 2) instead of to U+2019 (Right Single Quotation Mark). The resulting UTF-8 encoding was placed in the web page, which however is declared as ISO 8859-1.

by wvbdmp1 hours ago|

parent|

[-]

Delicious, thank you!

by layer81 hours ago|

parent|

[-]

I edited my post after verifying the actual bytes, it turned out to be slightly more complicated.

by well_actulily1 hours ago|

parent|

[-]

The double-encoding path gets you there too: the original UTF-8 \xE2 \x80 \x99 mis-decoded as iso-8859-1 or Windows-1252 and saved back as UTF-8 gives \xC3 \xA2 \xC2 \x80 \xC2 \x99, which in Windows-1252 renders as Ã¢Â€Â™. A WYSIWYG cleanup replacing that mojibake with the Windows-1252 ' (byte 0x92) and saving back as UTF-8 gets you to \xC2 \x92 on disk.

Edit: Although maybe that's not the most parsimonious explanation.

by root-parent1 hours ago|

parent|

prev|

[-]

this one does g11n....

by 1 hours ago|

prev|

[-]

deleted

by netsharc2 hours ago|

prev|

[-]

They're probably Microsoft's "Smart Quotes", which are Unicode. They were presumably stored in UTF-8 but retrieved as ASCII (or ISO-8859-1).