undefined

upvote

points

by entuno12 hours ago |

upvote

by jncraton11 hours ago|

[-]

You are right that the concept of "safe" is nebulous, but the goal here is specifically to be XSS-safe [1]. Elements or properties that could allow scripts to execute are removed. This functionality lives in the user agent and prevents adding unsafe elements to the DOM itself, so it should be easier to get correct than a string-to-string sanitizer. The logic of "is the element currently being added to the DOM a <script>" is fundamentally easier to get right than "does this HTML string include a script tag".

[1] https://developer.mozilla.org/en-US/docs/Web/API/Element/set...

reply

upvote

by entuno8 hours ago|

[-]

It's certainly an improvement over people trying to homebrew their own sanitisers. But that distinction of being XSS-safe is a potentially subtle one, and could end up being dangerous if people don't carefully consider whether XSS-safe is good enough when they're handling arbitrary users input like that.

reply

upvote

by intrasight8 hours ago|

[-]

Also has made me nervous for years that there's been no schema against which one can validate HTML. "You want to validate? Paste your URL into the online validation tool."

reply

upvote

by Dylan168074 hours ago|

[-]

This help? https://github.com/validator/validator

But for html snippets you can pretty much just check that tags follow a couple simple rules between <> and that they're closed or not closed correctly.

reply

upvote

by intrasight4 hours ago|

[-]

That app does look helpful!

reply

upvote

by Cthulhu_10 hours ago|

[-]

Ideally you should be able to set a global property somewhere (as a web developer) that disallows outdated APIs like `innerHTML`, but with the Big Caveat that your website will not work on browsers older than X. But maybe there's web standards for that already, backup content if a browser is considered outdated.

reply

upvote

by cxr5 hours ago|

[-]

It's not an "outdated API". It's still good for what it was always meant for: parsing trusted, application-generated markup and atomically inserting it into the content tree as a replacement for a given element's existing children.

> set a global property somewhere (as a web developer) that disallows[…] `innerHTML`

    Object.defineProperty(Element.prototype, "innerHTML", {
      set: (() => { throw Error("No!") })
    });

(Not that you should actually do this—anyone who has to resort to it in their codebase has deeper problems.)

reply

upvote

by staticassertion10 hours ago|

[-]

Doesn't using TrustedTypes basically do that? I'm not really web-y, someone please correct me if I'm off.

reply

upvote

by madeofpalk8 hours ago|

[-]

Yup, this is basically what TrustedTypes is for!

reply

upvote

by afavour10 hours ago|

[-]

I like the idea of that. But I imagine linting rules are a much more immediate answer in a lot of projects.

reply

upvote

by voxic1112 hours ago|

[-]

The idea is you wouldn't mix innerHTML and setHTML, you would eliminate all usage of innerHTML and use the new setHTMLUnsafe if you needed the old functionality.

reply

upvote

by extraduder_ire11 hours ago|

[-]

I looked up setHTMLUnsafe on MDN, and it looks like its been in every notable browser since last year.

Good idea to ship that one first, when it's easier to implement and is going to be the unsafe fallback going forward.

reply

upvote

by onion2k9 hours ago|

[-]

I looked up setHTMLUnsafe on MDN, and it looks like its been in every notable browser since last year.

Oddly though, the Sanitizer API that it's built on doesn't appear to be in Safari. https://developer.mozilla.org/en-US/docs/Web/API/Sanitizer

reply

upvote

by croes11 hours ago|

[-]

If I need the old functionality why not stick to innerHTML?

reply

upvote

by orf11 hours ago|

[-]

because the "unsafe" suffix conveys information to the reader, whereas `innherHTML` does not?

reply

upvote

by goatlover10 hours ago|

[-]

Any potential reader should be familiar with innerHTML.

reply

upvote

by kennywinker9 hours ago|

[-]

Right. Like how any potential reader is familiar with the risks of sql injection which is why nothing has ever been hacked that way.

Or how any potential driver is familiar with seat belts which is why everybody wears them and nobody’s been thrown from a car since they were invented.

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by orf9 hours ago|

[-]

yes, and bugs shouldn't exist because everyone should be familiar with everything.

reply

upvote

by croes8 hours ago|

[-]

But if some are marked unsafe and others are not it gives a false sense of security if something is not marked unsafe.

reply

upvote

by orf8 hours ago|

[-]

So we shouldn’t mark anything as unsafe then? And give no indication whatsoever?

The issue isn’t that the word “safe” doesn’t appear in safe variants, it’s that “unsafe” makes your intentions clear: “I know this is unsafe, but it’s fine because of X and Y”.

reply

upvote

by croes7 hours ago|

[-]

Maybe we should add the word safe and consider everything else as unsafe

reply

upvote

by orf6 hours ago|

[-]

Like life, things should default to being safe. Unsafe, unexpected behaviours should be exception and thus require an exceptional name.

Legacy and backwards compatibility hampers this, but going forward…

reply

upvote

by tbrownaw11 hours ago|

[-]

Because then your linter won't be able to tell you when you're done migrating the calls that can be migrated.

reply

upvote

by philipwhiuk9 hours ago|

[-]

Because sooner or later it'll be removed.

reply

upvote

by goatlover8 hours ago|

[-]

No because the web has to remain backwards compatible with older sites. This has always been the case.

reply

upvote

by croes8 hours ago|

[-]

And break millions of sites?

reply

upvote

by reddalo11 hours ago|

[-]

You can't rename an existing method. It would break compatibility with existing websites.

reply

upvote

by post-it11 hours ago|

[-]

> you would eliminate all usage of innerHTML

The mythical refactor where all deprecated code is replaced with modern code. I'm not sure it has ever happened.

I don't have an alternative of course, adding new methods while keeping the old ones is the only way to edit an append-only standard like the web.

reply

upvote

by thenewnewguy11 hours ago|

[-]

If you want to adopt this in your project, you can add a linter that explicitly bans innerHTML (and then go fix the issues it finds). Obviously Mozilla cannot magically fix the code of every website on the web but the tools exist for _your_ website.

reply

upvote

by Vinnl11 hours ago|

[-]

I kinda like the way JS evolved into a modern language, where essentially ~everyone uses a linter that e.g. prevents the use of `var`. Sure, it's technically still in the language, but it's almost never used anymore.

(Assuming transpilers have stopped outputting it, which I'm not confident about.)

reply

upvote

by yawaramin7 hours ago|

[-]

Actually... https://github.com/microsoft/TypeScript/issues/52924

reply

upvote

by Vinnl7 hours ago|

[-]

Ah yeah, I remember that. General point still stands: in terms of the lived experience of developers, `var` is essentially deprecated.

reply

upvote

by plorkyeran5 hours ago|

[-]

I touch JS that uses var heavily on a daily basis and I would be incredibly surprised to find out that I am alone in that.

reply

upvote

by delaminator11 hours ago|

[-]

for some values of "everyone" and "never".

reply

upvote

by thunderfork11 hours ago|

[-]

Depending on the transpiler and mode of operation, `var` is sometimes emitted.

For example, esbuild will emit var when targeting ESM, for performance and minification reasons. Because ESM has its own inherent scope barrier, this is fine, but it won't apply the same optimizations when targeting (e.g.) IIFE, because it's not fine in that context.

https://github.com/evanw/esbuild/issues/1301

reply

upvote

by bulbar11 hours ago|

[-]

It for sure happens for drop in replacements.

reply

upvote

by littlestymaar10 hours ago|

[-]

Nobody's talking about old code here.

Having an alternative to innerHTML means you can ban it from new code through linting.

reply

upvote

by noduerme11 hours ago|

[-]

Finally, a good use case for AI.

reply

upvote

by Aachen11 hours ago|

[-]

Yeah, using a kilowatt GPU for string replacement is going to be the killer feature. I probably shouldn't even be joking, people are using it like this already

reply

upvote

by charcircuit11 hours ago|

[-]

When the condition for when you want to replace is hard to properly specify, AI shines for such find and replaces.

reply

upvote

by Aachen10 hours ago|

[-]

This one is literally matching "innerHTML = X" and setting "setHTML(X)" instead. Not some complex data format transformation

But I can see what you mean, even if then it would still be better for it to print the code that does what you want (uses a few Wh) than doing the actual transformation itself (prone to mistakes, injection attacks, and uses however many tokens your input data is)

reply

upvote

by charcircuit9 hours ago|

[-]

That can break the site if you do the find and replace blindly. The goal here is to do the refactor without breaking the site.

reply

upvote

by lelanthran7 hours ago|

[-]

> When the condition for when you want to replace is hard to properly specify, AI shines for such find and replaces.

And, in your opinion, this is one of those cases?

reply

upvote

by charcircuit3 hours ago|

[-]

It is because the new API purposefully blocks things the old API did not.

reply

upvote

by littlestymaar10 hours ago|

[-]

This ship has sailed unfortunately, no later than yesterday I've seen coworkers redact a screenshot using chatGTP.

reply

upvote

by josefx11 hours ago|

[-]

Wouldn't AI be trained on data using innerHTML?

reply

upvote

by Aachen11 hours ago|

[-]

My experience is that they somehow print quite modern code despite things like ES6 being too new to be standard knowledge even for me and I'm not even middle-aged yet

Maybe the last 10 years saw so much more modern code than the last cumulative 40+ years of coding and so modern code is statistically more likely to be output? Or maybe they assign higher weights to more recent commits/sources during training? Not sure but it seems to be good at picking this up. And you can always feed the info into its context window until then

reply

upvote

by skeeter202010 hours ago|

[-]

This is not my experience. Claude has been happily generating code over the past week that is full of implicit any and using code that's been deprecated for at least 2 years.

>> Maybe the last 10 years saw so much more modern code than the last cumulative 40+ years of coding and so modern code is statistically more likely to be output?

The rate of change has made defining "modern" even more difficult and the timeframe brief, plus all that new code is based on old code, so it's more like a leaning tower than some sort of solid foundation.

reply

upvote

by SahAssar9 hours ago|

[-]

ES6 is 11 years old. It's not that new.

reply

upvote

by chrisweekly8 hours ago|

[-]

> "ES6 being too new to be standard knowledge"

Huh? It's been a decade.

reply

upvote

by charcircuit11 hours ago|

[-]

Which is why it can easily understand how innerHTML is being used so that it can replace it with the right thing.

reply

upvote

by stvltvs11 hours ago|

[-]

Honest question: Is there a way to get an LLM to stop emitting deprecated code?

reply

upvote

by fragmede11 hours ago|

[-]

Theoretically, if you could train your own, and remove all references to the deprecated code in the training data, it wouldn't be able to emit deprecated code. Realistically that ability is out of reach at the hobbiest level so it will have to remain theoretical for at least a few more iterations of Moore's law.

reply

upvote

by cxr5 hours ago|

[-]

> it's not at all clear which is which from the names. Ideally you design that in from the [start]

It was, and there is: setting elementNode.textContent is safe for untrusted inputs, and setting elementNode.innerHTML is unsafe for untrusted inputs. The former will escape everything, and the latter won't escape anything.

You are right that these "sanitizers" are fundamentally confused:

> "HTML sanitization" is never going to be solved because it's not solvable.¶ There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement.

<https://news.ycombinator.com/item?id=46222923>

The Web platform folks who are responsible for getting fundamental APIs standardized and implemented natively are in a position to know better, and they should know better. This API should not have made it past proposal stage and should not have been added to browsers.

reply

upvote

by Dylan168074 hours ago|

[-]

> There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement.

It is not a hard requirement that untrusted input is "treated like text". And this API lets you customize exactly what tags/attributes are allowed in the untrusted input. That's way better than telling everyone to write their own; it's not trivial.

reply

upvote

by cxr2 hours ago|

[-]

It is not a hard requirement that untrusted input is "treated like text".

It's also not a hard requirement that I defend the position that there's a hard requirement for untrusted input to be treated like text. That isn't my position, and it's not what I wrote.

Given that it is not a hard requirement that untrusted input be treated like text, it wouldn't make sense for anyone to claim that it is—and therefore it doesn't make sense for someone, presented with I did write, to strenuously argue with me that such a tortured, implausible, uncharitable, non-sensical interpretation of what I wrote was something that I have to account for (versus the interpretation that does match what I wrote and is actually true and makes sense).

You are, willfully or not, misconstruing what I have written.

> That's way better than telling everyone to write their own; it's not trivial.

Right, it's not trivial. It's so far the opposite of trivial that it's (as I said the first time—and again, just now) not solvable.

No one should be writing their own.

No one should be trying to write their own.

No one should be using this API at all.

And no one should have pushed for its implementation.

It's a bad API.

reply

upvote

by Dylan168071 hours ago|

[-]

I thought you were done talking to me?

Briefly though, if you have an untrusted string then you need to either treat it like text or sanitize it. I don't see any other options.

So if people shouldn't use this sanitizer or write their own, then the only option left is treating the string as text. But you're vehemently arguing that's not what you said.

What's the other way to use an untrusted string? Other than "don't", but that means not taking input and only works for toy apps.

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by cxr4 hours ago|

[-]

[flagged]

reply

upvote

by Dylan168074 hours ago|

[-]

I don't see how I differed from what you said? You divided strings going into HTML into two categories, where one category uses textContent and the other category uses innerHTML. My point is to disagree with those categories, not whatever subtle thing you're taking issue with.

reply

upvote

by cxr4 hours ago|

[-]

Oh, okay. Tell me, dipshit, are the follow two claims equivalent or different?:

"Everyone who files a tax return should know whether they need to pay at least $1000 in unpaid taxes to the IRS."

"Everyone who files a tax return needs to pay at least $1000 in unpaid taxes to the IRS."

> You divided strings going into HTML into two categories, where one category uses textContent and the other category uses innerHTML.

No, I didn't:

> setting elementNode.textContent is safe for untrusted inputs, and setting elementNode.innerHTML is unsafe for untrusted inputs

That's what I wrote: a statement containing two claims (both true—and not even in the part of my comment that you actually quoted and pretended to be replying to).

reply

upvote

by Dylan168073 hours ago|

[-]

This is a totally different kind of statement. You're not dividing tax returns into two categories and then saying what to do with each category.

Those claims are different but not in a way that analogizes to the HTML conversation.

reply

upvote

by cxr3 hours ago|

[-]

I'd say I'm interested in hearing how you reason that knowing whether you need to pay at least $1000 in unpaid taxes to the IRS doesn't put you in one bucket or another, but I'm not.

reply

upvote

by Dylan168073 hours ago|

[-]

The IRS thing indirectly has categories but it doesn't say what to do with them, and what to do with them is what I disagreed with your original post on. I didn't say all input is untrusted or whatever analogizes to your tax thing.

Anyway, I see you edited your previous post after I wrote my reply.

If you weren't trying to divide things into two categories, you wrote it very confusingly. When you say how to handle trusted strings, then say how to handle untrusted strings, then say "There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement." it really sounds like that's supposed that's supposed to cover all cases.

Me thinking you were using two categories is an honest mistake, not malicious misquoting.

And reading your original post that way is the interpretation that makes it stronger. If there are more categories then SetHTML is no longer "fundamentally confused". Your argument against it falls apart.

reply

upvote

by cxr2 hours ago|

[-]

Guess how interested I am in pretending that a debate with you—about this or anything else—is worthwhile (or anything, really, other than an even bigger waste of time than it already has been).

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by jaffathecake11 hours ago|

[-]

fwiw, if you serve your page with:

Content-Security-Policy: require-trusted-types-for 'script'

…then it blocks you from passing regular strings to the methods that don't sanitize.

reply

upvote

by DoctorOW11 hours ago|

[-]

They do link the default configuration for "safe": https://wicg.github.io/sanitizer-api/#built-in-safe-default-...

But I agree, my default approach has usually been to only use innerText if it has untrusted content:

So if their demo is this:

    container.SetHTML(`<h1>Hello, {name}</h1>`);

Mine would be:

    let greetingHeader = container.CreateElement("h1");
    greetingHeader.innerText = `Hello, {name}`;

reply

upvote

by itishappy11 hours ago|

[-]

What if I wanted an <h2>?

Edit: I don't mean this flippantly. If I want to render, say, my blog entry on your site, will I need to select every markup element from a dropdown list of custom elements that only accept text a la Wordpress?

reply

upvote

by DoctorOW3 hours ago|

[-]

If it's anything complex I'm doing it server side, personally

reply

upvote

by HWR_148 hours ago|

[-]

That's why I only allow user input of alphanumeric ascii characters. No need to worry about sanitation then, and you can just remove all the characters that don't match.

(It's a joke, but it is also 100% XSS, SQL injection, etc. safe and future proof)

reply

upvote

by noduerme11 hours ago|

[-]

Some sanitization is better than none? If you're relying on the browser to handle it for you, you're already in a lot of trouble.

reply

upvote

by thaumasiotes5 hours ago|

[-]

> I'm also rather sceptical of things that "sanitise" HTML, both because there's a long history of them having holes, and because it's not immediately clear what that means, and what exactly is considered "safe".

What is safe depends on where the sanitized HTML is going, on what you're doing with it.

It isn't possible to "sanitize HTML" after collecting it so that, when you use it in the future, it will be safe. "Safe" is defined by the use.

But it is possible to sanitize it before using it, when you know what the use will be.

reply

upvote

by post-it11 hours ago|

[-]

realSetSafeHTML()

reply

upvote

by 11 hours ago|

[-]

deleted

reply

upvote

by snowhale11 hours ago|

[-]

[dead]

reply

upvote

by pornel11 hours ago|

[-]

BTW, HTML allows inline SVG with an XML-flavored syntax that interprets <script/> and <title> differently. It's a goldmine for sanitizer escapes. There are completely bonkers syntax switching and error recovery rules that interact with parsing modes (there's even an edge case where a particular attribute value switches between HTML and XML-ish parsing rules).

Don't even try to allow inline <svg> from untrusted sources! (and then you still must sanitise any svg files you host)

reply

upvote

by kccqzy10 hours ago|

[-]

If you just serve SVGs through <img> tag it’ll be much safer. I never understood the appeal of inline <svg> anyways.

reply

upvote

by lenkite8 hours ago|

[-]

Inline SVG is stylable with CSS styles in the same HTML page.

reply

upvote

by runarberg6 hours ago|

[-]

Also animatible with the same context (Animation API, etc.) as the parent page, so different SVGs can influence each other’s animations.

reply

upvote

by rwj9 hours ago|

[-]

Inline reduces round trips.

reply

upvote

by toast08 hours ago|

[-]

You can use img with a data url?

reply

upvote

by cxr5 hours ago|

[-]

It may be using some of the same deserialization machinery, but "parsing" is a broad term that includes things that the sanitizer is doing and that the browser's ordinary content-processing → rendering path does not.

Even with this being a native API, there are still two parsers that need to be maintained. What a native API achieves is to shift the onus for maintaining synchronicity between the two onto the browser makers. That's not nothing, but it's also not the sort of free lunch that some people naively believe it is.

reply

upvote

by onion2k9 hours ago|

[-]

it's not at all clear which is which from the names

There's setHTML and setHTMLUnsafe. That seems about as clear as you can get.

reply

upvote

by entuno8 hours ago|

[-]

If that'd been the design from the start, then sure. But it's not at all obvious that setHTML is safe with arbitrary user input (for a given value of "safe") and innerHTML is dangerous.

reply

upvote

by hahn-kev9 hours ago|

[-]

But you can use InnerHTML to set HTML and that's not safe.

reply

upvote

by onion2k8 hours ago|

[-]

At this point that API has been around for decades and is probably impossible to deprecate without breaking fairly large amounts of the web. The only option is to introduce a new and better API, and maybe eventually have the browser throw out console warnings if a page still uses the old innerHTML API. I doubt any browser vendor will be gung ho enough to actually remove it for a very long time.

reply