(hacks.mozilla.org)
I'm also rather sceptical of things that "sanitise" HTML, both because there's a long history of them having holes, and because it's not immediately clear what that means, and what exactly is considered "safe".
[1] https://developer.mozilla.org/en-US/docs/Web/API/Element/set...
But for html snippets you can pretty much just check that tags follow a couple simple rules between <> and that they're closed or not closed correctly.
Don't even try to allow inline <svg> from untrusted sources! (and then you still must sanitise any svg files you host)
Even with this being a native API, there are still two parsers that need to be maintained. What a native API achieves is to shift the onus for maintaining synchronicity between the two onto the browser makers. That's not nothing, but it's also not the sort of free lunch that some people naively believe it is.
It was, and there is: setting elementNode.textContent is safe for untrusted inputs, and setting elementNode.innerHTML is unsafe for untrusted inputs. The former will escape everything, and the latter won't escape anything.
You are right that these "sanitizers" are fundamentally confused:
> "HTML sanitization" is never going to be solved because it's not solvable.¶ There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement.
<https://news.ycombinator.com/item?id=46222923>
The Web platform folks who are responsible for getting fundamental APIs standardized and implemented natively are in a position to know better, and they should know better. This API should not have made it past proposal stage and should not have been added to browsers.
It is not a hard requirement that untrusted input is "treated like text". And this API lets you customize exactly what tags/attributes are allowed in the untrusted input. That's way better than telling everyone to write their own; it's not trivial.
"Everyone who files a tax return should know whether they need to pay at least $1000 in unpaid taxes to the IRS."
"Everyone who files a tax return needs to pay at least $1000 in unpaid taxes to the IRS."
> You divided strings going into HTML into two categories, where one category uses textContent and the other category uses innerHTML.
No, I didn't:
> setting elementNode.textContent is safe for untrusted inputs, and setting elementNode.innerHTML is unsafe for untrusted inputs
That's what I wrote: a statement containing two claims (both true—and not even in the part of my comment that you actually quoted and pretended to be replying to).
Those claims are different but not in a way that analogizes to the HTML conversation.
Anyway, I see you edited your previous post after I wrote my reply.
If you weren't trying to divide things into two categories, you wrote it very confusingly. When you say how to handle trusted strings, then say how to handle untrusted strings, then say "There's no getting around knowing whether or any arbitrary string is legitimate markup from a trusted source or some untrusted input that needs to be treated like text. This is a hard requirement." it really sounds like that's supposed that's supposed to cover all cases.
Me thinking you were using two categories is an honest mistake, not malicious misquoting.
And reading your original post that way is the interpretation that makes it stronger. If there are more categories then SetHTML is no longer "fundamentally confused". Your argument against it falls apart.
> set a global property somewhere (as a web developer) that disallows[…] `innerHTML`
Object.defineProperty(Element.prototype, "innerHTML", {
set: (() => { throw Error("No!") })
});
(Not that you should actually do this—anyone who has to resort to it in their codebase has deeper problems.)Good idea to ship that one first, when it's easier to implement and is going to be the unsafe fallback going forward.
Oddly though, the Sanitizer API that it's built on doesn't appear to be in Safari. https://developer.mozilla.org/en-US/docs/Web/API/Sanitizer
Or how any potential driver is familiar with seat belts which is why everybody wears them and nobody’s been thrown from a car since they were invented.
The issue isn’t that the word “safe” doesn’t appear in safe variants, it’s that “unsafe” makes your intentions clear: “I know this is unsafe, but it’s fine because of X and Y”.
Legacy and backwards compatibility hampers this, but going forward…
The mythical refactor where all deprecated code is replaced with modern code. I'm not sure it has ever happened.
I don't have an alternative of course, adding new methods while keeping the old ones is the only way to edit an append-only standard like the web.
(Assuming transpilers have stopped outputting it, which I'm not confident about.)
For example, esbuild will emit var when targeting ESM, for performance and minification reasons. Because ESM has its own inherent scope barrier, this is fine, but it won't apply the same optimizations when targeting (e.g.) IIFE, because it's not fine in that context.
Having an alternative to innerHTML means you can ban it from new code through linting.
But I can see what you mean, even if then it would still be better for it to print the code that does what you want (uses a few Wh) than doing the actual transformation itself (prone to mistakes, injection attacks, and uses however many tokens your input data is)
And, in your opinion, this is one of those cases?
Maybe the last 10 years saw so much more modern code than the last cumulative 40+ years of coding and so modern code is statistically more likely to be output? Or maybe they assign higher weights to more recent commits/sources during training? Not sure but it seems to be good at picking this up. And you can always feed the info into its context window until then
>> Maybe the last 10 years saw so much more modern code than the last cumulative 40+ years of coding and so modern code is statistically more likely to be output?
The rate of change has made defining "modern" even more difficult and the timeframe brief, plus all that new code is based on old code, so it's more like a leaning tower than some sort of solid foundation.
Huh? It's been a decade.
Content-Security-Policy: require-trusted-types-for 'script'
…then it blocks you from passing regular strings to the methods that don't sanitize.
But I agree, my default approach has usually been to only use innerText if it has untrusted content:
So if their demo is this:
container.SetHTML(`<h1>Hello, {name}</h1>`);
Mine would be: let greetingHeader = container.CreateElement("h1");
greetingHeader.innerText = `Hello, {name}`;Edit: I don't mean this flippantly. If I want to render, say, my blog entry on your site, will I need to select every markup element from a dropdown list of custom elements that only accept text a la Wordpress?
(It's a joke, but it is also 100% XSS, SQL injection, etc. safe and future proof)
What is safe depends on where the sanitized HTML is going, on what you're doing with it.
It isn't possible to "sanitize HTML" after collecting it so that, when you use it in the future, it will be safe. "Safe" is defined by the use.
But it is possible to sanitize it before using it, when you know what the use will be.
There's setHTML and setHTMLUnsafe. That seems about as clear as you can get.
Preventing one bug class (script execution) is good, but this still allows arbitrary markup to the page (even <style> CSS rules) if I'm reading the docs correctly. You could give Paypal a fresh look for anyone who opens your profile page, if they use this. Who would ever want this?
The main case I can think of is wanting some forum functionality. Perhaps you want to allow your users to be able to write in markdown. This would provide an extra layer of protection as you could take the HTML generated from the markdown and further lock it down to only an allowed set of elements like `h1`. Just in case someone tried some of the markdown escape hatches that you didn't expect.
I think this might be the answer. There's no point to it by itself (either you separate data and code or you don't and let the user do anything to your page), but if you're already using a sanitiser and you can't use `textContent` because (such as with Markdown) there'll be HTML tags in the output, then this could be extra hardening. Thanks!
You still need innerHTML when you want to inject HTML tags in the page, and you could already use innerText when you didn't want to.
Having something in between is seriously useless.
What makes you say this?
If you mean to convey that it's possible to configure it to filter properly, let me introduce you to `textContent` which is older than Firefox (I'm struggling to find a date it's so old)
How would I set a header level using textContent?
document.createElement("h1").textContent = `Hello, ${username}!`
If you allow <h1> in the setHTML configuration or use the default, users with the tag in their username also always get it rendered as markupIf that's true, seems like it's still a security risk given what you can do with CSS these days: https://news.ycombinator.com/item?id=47132102
Or I guess you could completely restyle and change the text of UI elements so it looks like the user is doing one thing when they're actually doing something completely different like sending you money
.setHTML("<h1>Hello</h1>", new Sanitizer({}))
will strip all elements out. That's not too difficult.Plus this is defense-in-depth. Backends will still need to sanitize usernames on some standard anyhow (there's not a lot of systems out there that should take arbitrary Unicode input as usernames), and backends SHOULD (in the RFC sense [1]) still HTML-escape anything they output that they don't want to be raw HTML.
new Sanitizer({})
This Sanitizer will allow everything by default, but setHTML will still block elements/attributes that can lead to XSS.You might want something like:
new Sanitizer({ replaceWithChildrenElements: ["h1"], elements: [], attributes: [] })
This will replace <h1> elements with their children (i.e. text in this case), but disallow all other elements and attributes.Your lack of imagination is disturbing :-)
How exactly, given that setHTML sanitizes the input? If you don't want to have any HTML tags allowed, seems you can configure that already? https://wicg.github.io/sanitizer-api/#built-in-safe-default-...
The article says that the output is:
<h1>Hello my name is</h1>
So it keeps (non-script) html tags (and presumably also attributes) in the input. Idk how you're asking "how" since it's the default behaviorStripping HTML tags completely has always been possible with the drop-in replacement `textContent`. Making a custom configuration object for that is much more roundabout
I can see how it's a way of allowing some tags like bold and italic without needing a library or some custom parser, but I didn't understand what the point of this default could be and so why it exists (a sibling comment proposed a plausible answer: hardening on top of another solution)
> Yes, because that's the default configuration, if you don't want that, stop using the default configuration?
"don't use it if it's not what you want" is perhaps the silliest possible answer to the question "what's the use-case for this"
Maybe you meant .innerHTML? .innerText AFAIK doesn't try to parse HTML (why would it?), but I don't understand what you mean with nonstandard, both .innerHTML and .innerText are part of the standards, and I think they've been for a long time.
> but I didn't understand what the point of this default could be and so why it exists (a sibling comment proposed a plausible answer: hardening on top of another solution) [...] the question "what's the use-case for this"
I guess maybe third time could be the charm: it's for preventing XSS holes that are very common when people use .innerHTML
That information is in the question, so sadly no this still doesn't make sense to me because I don't understand any scenario in which this is what the developer wants. You always still need more code (to filter the right tags) or can just use textContent (separating data and code completely, imo the recommended solution)
> Maybe you meant .innerHTML? .innerText AFAIK doesn't try to parse HTML (why would it?)
No, I didn't mean that, yes it does, and no I don't know why it is this way. If you don't believe me and don't want to check it out for yourself, I'm not sure what more I can say
Client-side includes.
The default might be suitable for something like an internal blog where you want to allow people to sometimes go crazy with `<style>` tags etc, just not inject scripts, but I would expect it to almost always make sense to define a specific allowed tag and attribute list, as is usually done with the userland predecessors to this API.
Anyone who wants to provide some level of flexibility but within bounds. Say, you want to allow <strong> and <em> in a forum post but not <script>. It's not too difficult to imagine uses.
With a safe API like this one that's tied to the browser's own interpretation of HTML (i.e. it is perfectly placed to know exactly what is and isn't dangerous given it is the one rendering it) wouldn't it be much better to rely on that?
Are we taking out all the fun of the web? I absolutely loved the <marquee> names people had in the early days of Facebook, it was all harmless fun.
If injection of frontend code takes down your backend, your backend sucks, fix it.
Iframes have significant restrictions as they can’t flow with the DOM. With AI and the increase in dynamic content, there’s going to be even more situations where you run untrusted code. I want configurable encapsulation.
This really just seems like another attempt at reinventing the wheel. Somewhat related, I find it ironic how i cannot browse hacks.mozilla.org in my old version of firefox("Browser not supported"). Also, developer.mozilla.org loads mangled to various degrees in current versions of palemoon, basilisk, and seamonkey
It's like there is some sort of "browser cartel" trying to screw up The Web.
This is like saying C is memory safe as long as your code doesn't have any bugs.
More saliently, it does not consider parser differentials.
Don't get me wrong, better than nothing, but also really really consider just using "setText" instead and never allow the user to add any sort of HTML too the document.
What about when the author of the page wants to add large html fragments to the page?
Are you saying that you cannot think of a single use for this, considering how often innerHTML is being used?
I don’t ever use it with user input, but use it often when building SPA without frameworks
.set_html()
Makes objectively more sense than: .inner_html()
.inner_html =
.set_inner_html()
It is a fairly small thing, but ... really. One day someone should clean up the mess that is JavaScript. Guess it will never happen, but JavaScript has so many strange things ...I understand that this here is about protection against attacks rather than a better API design, but really - APIs should ideally be as great as possible the moment they are introduced and shown to the public.
The DOM API has always felt like, and still does, it was written by people that have never made an API.
So many issues in the client JS world originate from insufficient or bad browser APIs.
SQL("select * from user where name = " + name);
Kids in the '20s: div.innerHTML = "Hello " + user.name; "Summarize this email: " + email.contents
Prompt injection is just the same problem on a new technology. We didn't learn anything from the 90s. delete Element.prototype.innerHTML;
Then assignments to innerHTML do not modify the element's textContent or child node list and assignments to it will not throw an error.This new method they've cooked up would be called eval(code,options) if html was anything other than a markup language
https://stackoverflow.com/questions/78516750/parametrize-tab...
It would close the loop better if you could also use policy to switch off innerHTML in a given page, but definitely a step in the right direction for plain-JavaScript applications.