With context, this article is more interesting than the title might imply.
> The Sanitizer API is a proposed new browser API to bring a safe and easy-to-use capability to sanitize HTML into the web platform [and] is currently being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.
Which would replace the need for sanitizing user-entered content with libraries like DOMPurify by having it built into the browser's API.
Yeah, I was expecting something closer to "because that's what people Google for".
A big part of designing a security-related API is making it really easy and obvious to do the secure thing, and hide the insecure stuff behind a giant "here be dragons" sign. You want people to accidentally do the right thing, so you call your secure and insecure functions "setHTML" and "setUnsafeHTML" instead of "setSanitizedHTML" and "setHTML".
mysql_real_escape_string is only deprecated because there is mysqli_real_escape_string. I always wondered why it's "real"...like is there "fake" version of it?
The author really needs to start with that. They say "the API that we are building" and assume I know who they are and what they're working on, all the way until the very bottom. I just assumed it's some open source library.
> HTML parsing is not stable and a line of HTML being parsed and serialized and parsed again may turn into something rather different
Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous? Because it's pretty much the only thing you can do when sanitizing server-side, which we do a lot.
Edit: I also wonder how one would add for example rel="nofollow noreferrer" to links using this. Some sanitizers have a "post process node" visitor function for this purpose (it already has to traverse the dom tree anyway).
> Are there any examples where the first approach (sanitize to string and set inner html) is actually dangerous?
The article links to [0], which has some examples of instances in which HTML parsing is context-sensitive. The exact same string being put into a <div> might be totally fine, while putting it inside a <style> results in XSS.
> They say "the API that we are building" and assume I know who they are and what they're working on, all the way until the very bottom.
This is a common and rather tiresome critique of all kinds of blog posts. I think it is fair to assume the reader has a bit of contextual awareness when you publish on your personal blog. Yes, you were linked to it from a place without that context, but it’s readily available on the page, not a secret.
Well that's... certainly a take. But I have to disagree. Most traffic coming to blog posts is not from people who know you and are personally following your posts, they're from people who clicked a link to the article someone shared or found it while googling something.
It's not hard to add one line of context so readers aren't lost. Here, take this for example, combining a couple parts of the GitHub readme:
> For those who are unfamiliar, the Sanitizer API is a proposed new browser API being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.
Easy. Can fit that in right after "this blog post will explain why", and now everyone is on the same page.
> Most traffic coming to blog posts is not from people who know you and are personally following your posts
Do we have data to back that up? Anecdotally the blogs I have operated over the years tend to mostly sustain on repeat traffic from followers (with occasional bursts of external traffic if something trends on social media)
Here's my anecdotal data. Number of blogs that I personally follow: zero. And yet, somehow, I end up reading a lot of blog posts (mostly linked from HN, but also from other places in my webosphere).
(More than a bit irritated by the "Do you have data to back that up" thing, given that you don't really have data to back up your position).
> (More than a bit irritated by the "Do you have data to back that up" thing, given that you don't really have data to back up your position).
It wasn't necessarily a request for you personally to provide data. I'm curious if any larger blog operators have insight here.
"person who only reads the 0.001% of blog posts that reach the HN front page" is not terribly interesting as an anecdotal source on blog traffic patterns
What's hard in this case is that you end up making it 80% of the way through the article before you start to wonder what the heck this guy is talking about. So you have to click away to another page to figure out who the heck this guy is, then start again at the top of the article, reading it with that context in mind.
One word would have fixed the problem. "Why does the Mozilla API blah blah blah.". Perhaps "The Mozilla implementation used to...". Something like that.
They had a link in their post [0]: it seems like most of the examples are with HTML elements with wacky contextual parsing semantics such as <svg> or <noscript>. Their recommendation for server-side sanitization is "don't, lol", and they don't offer much advice regarding it.
Personally, my recommendation in most cases would be "maintain a strict list of common elements/attributes to allow in the serialized form, and don't put anything weird in that list: if a serialize-parse roundtrip has the remote possibility of breaking something, then you're allowing too much". Also, "if you want to mutate something, then do it in the object tree, not in the serialized version".
setHTML needs to support just about every element if it's going to be the standard way of rendering dynamic content. Certainly <svg> has to work or the API isn't useful.
SanitizeHTML functions in JS have had big security holes before, around edge cases like null bytes in values, or what counts as a space in Unicode. Browsers decided to be lenient in what they accept, so that means any serialize-parse chain creates some risk.
If you're rendering dynamic HTML, then either the source is authorized to insert arbitrary dynamic content onto the domain, or it isn't. And if it isn't, then you'll always have a hard time unless you're as strict as possible with your sanitization, given how many nonlocal effects can be embedded into an HTML snippet.
The more you allow, the less you know about what might happen. E.g., <svg> styling can very easily create clickjacking attacks. (If I wanted to allow SVGs at all, I'd consider shunting them into <img> tags with data URLs.) So anyone who does want to use these more 'advanced' features in the first place had better know what they're doing.
That overly reductive thinking can go back to the 80s before we had learned any lessons. There are degrees of trust. Binary thinking invites dramatic all or nothing failures.
And my point is that with HTML, there's always an extremely fine line between allowing "almost nothing" and "almost all of it" when it comes to sanitization. I'd love to live in a world where there are natural delineations of features that can safely be flipped on or off depending on how much control you want to give the source over the content, but in practice, there are dozens of HTML/CSS features (including everything in the linked article) that do wacky stuff that can cross over the lines.
> Because it's pretty much the only thing you can do when sanitizing server-side
I'd suggest not sanitizing user-provided HTML on the server. It's totally fine to do if you're fully sanitizing it, but gets a little sketchy when you want to keep certain elements and attributes.
I've been using Choosy.app for easily managing different browsers for work and personal (and testing), and it works great. You set it to your default browser, and then anytime something opens a browser it pops up a picker. Lots of global and per-site configuration options like browser profile selection, private windows, etc.
Would you say more about your experience writing it in Rust? It worked well, what didn't, anywhere you found that you struggled unexpectedly or that was easier than you expected?
Hey, thanks for asking. I'm the furthest from an authority in this so I encourage you to take everything I say with a grain of salt.
I was using the burn[0] crate which is pretty new but in active development and chock-full of features already. It comes with a lot of what you need out of the box including a TUI visualizer for the training and validation steps.
The fact that it's so full of features is a blessing and a curse. The code is very modular so you can use the pieces you want the way you want to use them, which is good, but the "flavor" of Rust in which is written felt like a burden compared to the way I'm used to writing Rust (which, for context, is 99% using the glorious iced[1] GUI library). I can't fault burn entirely for this, after all they are free to make their own design choices and I was a beginner trying to do this in less than a week. I also think they are trying to solve for getting a practitioner to just get up and going right away, whereas I was trying to build a modular configuration on top of the crate instead of a one-and-done type script.
But there were countless generic types, several traits to define and implement in order to make some generic parameter fit those bounds, and the crate has more proc_macro derives than I'd like (my target number is 0) such as `#[derive(Module, Config, new)]` because they obfuscate the code that I actually have to write and don't teach me anything.
TL;DR the crate felt super powerful but also very foreign. It didn't quite click to the point where I thought it was intuitive or I felt very fluent with it. But then again, I spent like 5 days with it.
One other minor annoying thing was that I couldn't download exactly what I wanted out of HuggingFace directly. I ended up having to use `HuggingfaceDatasetLoader::new("carlosejimenez/wikitext__wikitext-2-raw-v1")` instead of `HuggingfaceDatasetLoader::new("Salesforce/wikitext")` because the latter would get an auth error, but this may also be my ignorance about how HF is supposed to work...
Eventually, I got the whole thing to work quite neatly and was able to tweak hyperparameters and get my model to increasingly better perplexity. With more tweaks, a better tokenizer, possibly better data source, and an NVIDIA GPU rather than Apple Silicon, I could have squeezed even more out of it. My original goal was to try to slap an iced GUI on the project so that I could tweak the hyperparameters there, compare models, plot the training and inference, etc. with a GUI instead of code. Sort of a no-code approach to training models. I think it's an area worth exploring more, but I have a main quest I need to finish first so I just wrote down my findings in an unpublished "paper" and tabled it for now.
Even in the flood of terrible news about privacy and other things, this exposé stands out as especially disturbing. I was considering getting a new electric car to replace my combustion, but now I'm going to stretch it for as long as I can instead.
pry is what I miss most when using other languages. I've used all kinds of debuggers all kinds of hardware with many different languages, and pry is by far the best tool for development and debugging. People talk about the REPL in Lisp for good reason, but pry takes that concept to infinity and beyond. When I think about the future of AI assisted programming, it's something much more like the pry interactive development loop than a code editor's suggestions.
Reminds me of the story of the kids in Ethiopian village that were given tablets by One Laptop Per Child. The kids had figured out how to turn it on within minutes, in five days they were using 47 apps per child, in two weeks they were singing the English alphabet, and then within five months they had hacked Android. https://www.theregister.com/2012/11/01/kids_learn_hacking_an...
You do a great job explaining these concepts, better than most. I have appreciated all of your replies in this post. Do you have a blog or podcast or teach somewhere? I would tune in.
So the first thing is the meta learning. Looking at companies like Valve and Wolfram, they provide a template of another way of running companies which seem to consistently produce the best kind of software and incredible wealth for all those involved. The two things you look for when running a software company.
Next, Stephen livestreams his day to day as a CEO. This is so significant. I know the HN trope which dang warned about earlier, but I actually love it. Imagine if you could get detailed logs about how Steve Jobs lived his life. Not from books others write about him and make up fake stuff to make it sell more, but straight from the horse's mouth as they say. That is what his meticulous logs and streams of his life provides.
Gabe Newell of course does much less of this, but he still has some incredible videos which go so in depth in how he runs the business and what he thinks about.
Look, we are nerds. To learn business, we go online and try to piece together information. For example, I know for a fact a bunch of YC companies (both in this batch and earlier) have fallen for scammers like Alex Hormozi because he has a massive Youtube presence and just spews nonsense which sounds like it should make sense.
So in that world, to learn as close to first hand from people who actually run some of the biggest and most interesting business on the planet is just incredible.
Has anyone found or made a great set of tutorials for "Affinity for Photoshop Experts"? I've been using Photoshop for more than 30 years (now Photopea), and I don't think I've ever felt more like an alien than the two times I've tried in earnest to learn Affinity tools. A six month trial could be generous enough for me assimilate.
Thanks for this link. Firefox has been getting worse for me stability-wise on my Mac M1, even with tab discarding it consumes huge amounts of power, and at least two or three times a day it will just stop loading webpages and show errors in the network tab and need to be restarted. I spend a couple of hours every few weeks trying to track down the issues and Firefox and even in the bug tracker can't find answers.
I also have a bizarre problem where any Chromium-based browser (Chrome, Brave, Edge) are extremely slow to load any page since upgrading to Sonoma, where Firefox or Safari are near-instant - like taking 60 seconds to even start DNS lookup. After a couple of minutes it will eventually fully load a page. I've seen other people mention the same issue online, but no fixes. I have spent hours trying to debug and track down problems for that too.
It's discouraging how much it feels like every software tool I use on every device has gone to shit, especially things as fundamental as a web browser.
> The Sanitizer API is a proposed new browser API to bring a safe and easy-to-use capability to sanitize HTML into the web platform [and] is currently being incubated in the Sanitizer API WICG, with the goal of bringing this to the WHATWG.
Which would replace the need for sanitizing user-entered content with libraries like DOMPurify by having it built into the browser's API.
The proposed specification has additional information: https://github.com/WICG/sanitizer-api/