If this will pan out security-wise I don't know. I'm more worried that it will be so slow that no one will use it. Interesting idea, though, and I can see applications outside of the "big data" realm this apparently targets.
You could have some kind of OOM killer, but that will be a "footgun" that people who are actually doing "big data" will constantly shoot.
This pretty much kills any ingestion pipeline where the source is untrusted.
“Some code is untrusted” does not mean code should never be executed. There are more use cases with trusted sources than untrusted.
Either engines should put some limit (e.g. VARCHAR(2000) to enforce length to be limited to 2000, but there are some other engines supporting unlimited BLOBs), or decoder should give a hint what is the maximum length it will yield. Unfortunately current research level project does not have such considerations implemented yet...
Not that Wasm engines don't have bugs, but the whole point is to have an extremely solid, well-specified and efficient implementation of a widely accepted bytecode format. We can scope down the capabilities given to any program to a minimal set.
As a random example that's an area of personal interest to me, I know of 3 distinct methods of achieving userland ROP execution of the Nintendo Switch 2, and all three rely on the (ab)use of a scripting engine (even if they aren't a vulnerability in the scripting engine itself).
But seriously, if your format requires extensibility to the point that it embeds a bytecode, especially a Turing-complete bytecode, what format are you going to choose? Just design a new one? That's how you end up with a scripting engine with three ROP exploits.
I've heard that kind of sentiment many times before. It's not a good (thought-terminating) mindset to have for any secure software.
There are several WASM implementations, WASM is just a format. "Pure functions" are pure at a superficial level. Many people say that they don't mutate global state, but they do ... it's just hidden. The decoders "not needing a lot of access" doesn't matter if the WASM engine is pwned through arbitrary code execution inside the environment, or if it's contorted to bypass the access control you are mentioning through various side-effects.
They don't currently either do they? It's the tight coupling of the interface layer no? I'm not sure this would be faster, or more secure so reliability might be the best usecase?
(And as if we didn't trust the compiler enough already!)
You need to run a WASI environment for that.
Then there is a helper in this case to de-serialize, "primitive_array_from_buffers()"
https://github.com/future-file-format/F3/blob/bd92506447dc13...
Hell, Node.js didn't even get this ability until LAST MONTH:
https://nodejs.org/en/blog/release/v26.1.0
You'd have to write a second library to interface the C ABI with Node via NAPI just to consume it.
What do you mean by C bindings? C bindings to what?
When I'm working with data I'm working in a specific set of languages. Usually one. Yeah, other people might be working in other languages, but no individual author really needs a language-agnostic way of accessing data beyond compile time. Add to that the likely runtime boundaries that may need to be crossed instead of e.g. inlined by the compiler because it's in-language and dealing with known offsets or tags (depends on the data format of course). To the other commenter's point, am I going to have to sandbox all data access code just to be sure it's not able to do something unexpected? There's a lot of complexity here. And the inherent risk is going to slow down the operation that should be the simplest and fastest: interpreting bytes.
So this is really about making a file that is forwards compatible in a way that lets you push the standards more than existing formats.
That's so untrue! People need language-agnostic ways to access data all the time, and people work with data accessing them from multiple languages all the time!
If I have parquet files I can load them in duckdb, in pandas and polars, process them with various independent tools, and loads of other things... and people do that.
This is also why people like something like an SQL database, your data is not locked to some specific language / lib for access.
Yes, which is why nobody uses PDFs.
Maybe I would pick the eBPF VM instead, with all its limiting and verifying mechanics.
> This security update resolves a publicly disclosed vulnerability in Microsoft Windows. The vulnerability could allow remote code execution if a user opens a specially crafted document or visits a malicious Web page that embeds TrueType font files.
> This security update is rated Critical for all supported releases of Microsoft Windows. For more information, see the subsection, Affected and Non-Affected Software, in this section.
> The security update addresses the vulnerability by modifying the way that a Windows kernel-mode driver handles TrueType font files. For more information about the vulnerability, see the Frequently Asked Questions (FAQ) subsection for the specific vulnerability entry under the next section, Vulnerability Information.
[1] https://www.bleepingcomputer.com/news/security/facebook-disc...
[2] https://blakecrosley.com/blog/truetype-hinting-swift-migrati...
A file is a bag of bytes. You can send those bytes to different things, like a text editor's content-stream, or as the input to a WASM interpreter.
What you decide to do with the bytes in a file is your own prerogative. Each byte is whatever you make of it.
The WASM encoders/decoders are embedded resources that exist as byte offsets in the file metadata, not header info.
Compare that to JSON. The parser NEVER needs to execute arbitrary instructions. Parser might have bugs, but it avoids a whole class of issues.
> the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.
And then do what with it?WASM physically cannot interact with the underlying host or perform I/O -- you need a WASI environment for that.
I'd say at worst it's setup for poor security
Doing `head foo.exe` is quite different than `run foo.exe`
If I encode executable instructions in "image.png" and then send them to an interpreter that runs those instructions, the file extension doesn't matter.