undefined

[-]

WASM has strong tried and proven sandboxing. We basically can build on nearly 30 years of experience. The decoders don't need a lot of access, they can basically be pure functions.

If this will pan out security-wise I don't know. I'm more worried that it will be so slow that no one will use it. Interesting idea, though, and I can see applications outside of the "big data" realm this apparently targets.

by ok1234561 hours ago|

[-]

How do you prevent compression bomb attacks when files can define their own compression functions?

You could have some kind of OOM killer, but that will be a "footgun" that people who are actually doing "big data" will constantly shoot.

This pretty much kills any ingestion pipeline where the source is untrusted.

by computomatic1 hours ago|

[-]

It seems like the WASM is simply a fallback if no other decoder is available. If the data source is untrusted, simply don’t run the WASM decoders.

“Some code is untrusted” does not mean code should never be executed. There are more use cases with trusted sources than untrusted.

by ok1234561 hours ago|

[-]

So I define the data type to be "asdklfjaslkdfjiolsadfjoiusadfoiasfoikasjfdoisadf" and give you a decoder for it.

by johncolanduoni1 hours ago|

[-]

OOM killing in WebAssembly is trivial, since it’s all in a growable linear memory. All the runtimes I’m aware of have a simple maximum memory setting, and they’ll trap any allocation requests after that point.

by blmarket1 hours ago|

[-]

Attack is not just on file format itself. Based on the function signature it's possible for a single decoder to generate infinite bytestream - makes a lot of headache to reader implementation - implementing STRLEN is no longer trivial question.

Either engines should put some limit (e.g. VARCHAR(2000) to enforce length to be limited to 2000, but there are some other engines supporting unlimited BLOBs), or decoder should give a hint what is the maximum length it will yield. Unfortunately current research level project does not have such considerations implemented yet...

by ok1234561 hours ago|

[-]

For images, it makes sense: people dealing with 16k x 16k PNGs are uncommon. Give them an error message that tells them the setting to bump. But what should be the threshold for "big data"? I'm sure it will follow Zipf's Law, but the tail will be fatter.

[-]

And many of them have built-in gas metering, so you can time out the decode if it runs too many instructions.

by kibwen1 hours ago|

[-]

Denial-of-service is bad, but it's not in the same ballpark, the same sport, the same planet, or the same universe of bad as RCE.

by Retr0id1 hours ago|

[-]

WASM implementations are fairly mature now, but if there was e.g. an image file format with embedded WASM that needed to execute before you could view it, it would become the new low-hanging-fruit target for 0-click RCEs - whether it's exploiting the WASM engine itself or some other attack surface that's influenceable via it (See also, the FORCEDENTRY JBIG2 exploit).

[-]

That exploit targeted an integer overflow in a bespoke Apple sandboxing mechanism. Bespoke sandboxing mechanisms have weird bugs.

Not that Wasm engines don't have bugs, but the whole point is to have an extremely solid, well-specified and efficient implementation of a widely accepted bytecode format. We can scope down the capabilities given to any program to a minimal set.

by Retr0id1 hours ago|

[-]

Bugs are near-inevitable, and mitigations are the last line of defence. Scripting engines are excellent for bypassing mitigations (iiuc in the case of the FORCEDENTRY exploit, it was used for adjusting ASLR'd offsets).

As a random example that's an area of personal interest to me, I know of 3 distinct methods of achieving userland ROP execution of the Nintendo Switch 2, and all three rely on the (ab)use of a scripting engine (even if they aren't a vulnerability in the scripting engine itself).

[-]

Well don't accept code from anyone ever then.

But seriously, if your format requires extensibility to the point that it embeds a bytecode, especially a Turing-complete bytecode, what format are you going to choose? Just design a new one? That's how you end up with a scripting engine with three ROP exploits.

by Kiboneu1 hours ago|

[-]

> WASM has strong tried and proven sandboxing. We basically can build on nearly 30 years of experience. The decoders don't need a lot of access, they can basically be pure functions.

I've heard that kind of sentiment many times before. It's not a good (thought-terminating) mindset to have for any secure software.

There are several WASM implementations, WASM is just a format. "Pure functions" are pure at a superficial level. Many people say that they don't mutate global state, but they do ... it's just hidden. The decoders "not needing a lot of access" doesn't matter if the WASM engine is pwned through arbitrary code execution inside the environment, or if it's contorted to bypass the access control you are mentioning through various side-effects.

by bilekas1 hours ago|

[-]

> The decoders don't need a lot of access, they can basically be pure functions

They don't currently either do they? It's the tight coupling of the interface layer no? I'm not sure this would be faster, or more secure so reliability might be the best usecase?

by arcfour2 hours ago|

[-]

Yes...my first thought. No way in hell anyone actually trusts this.

(And as if we didn't trust the compiler enough already!)

by Omega3591 hours ago|

[-]

Meh, it's not that bad. Pretty simple to block inline wasm and to use well known external decoders.

by nine_k2 hours ago|

[-]

Does WASM have built-in I/O? If not, all that a decoder would be able to do is to decode into a buffer.

by 0x4571 hours ago|

[-]

All WASM can do is transfer bag of bytes between module runtime and host. So yes, so yeah it can just decode into a buffer. Even you use wasm components to give it I/O, you can still make these go to buffer.

by doctorpangloss2 hours ago|

[-]

But the WASM runs in the sandbox! It only has access to some files, your display, inputs, ... nothing insecure at all!

by gavinray1 hours ago|

[-]

WASM runs in a confined memory space allocated for the program. There is no I/O or host address space access.

You need to run a WASI environment for that.

by rebeccajae1 hours ago|

[-]

It sounds neat, but feels like it might fall apart with higher-complexity formats. What does an embedded decoder for a PDF look like? I guess since they are tightly-coupled to the file bytes themselves, the author of the file gets to choose what formats make sense, but not all formats have a one-true-decode-step.

by aseipp1 hours ago|

[-]

Despite the name seemingly implying otherwise, F3 is an alternative to columnar storage formats like Parquet; the goal is not to support every conceivable encoding of every file type such as a PDF. Think of the use cases being more like "What if you used a specialized compressor and need a custom block decompression algorithm" or "Decode internal format into Arrow output" or something like that.

by mort961 hours ago|

[-]

I don't understand how that's supposed to work. What does the decoder decode into? That's gonna depend entirely on the kind of data, right? For some formats, it's gonna be a stream of bytes; for others, a 2D plane of pixels; others again will need vertexes, 2D planes of pixels and UV maps; for some, an object graph will make more sense.

by gavinray54 minutes ago|

https://github.com/future-file-format/F3/blob/bd92506447dc13...

[-]

It appears as though the WASM decode returns two values -- one indicating the data type as a primitive value, and a second value being the data buffer

Then there is a helper in this case to de-serialize, "primitive_array_from_buffers()"

by cbm-vic-201 hours ago|

[-]

Applets redux.

by grodes2 hours ago|

[-]

How is wasm better than C bindings?

https://nodejs.org/en/blog/release/v26.1.0

[-]

Many languages don't have ergonomic experiences for working with C ABI's without explicit wrapper code.

Hell, Node.js didn't even get this ability until LAST MONTH:

You'd have to write a second library to interface the C ABI with Node via NAPI just to consume it.

by bluejekyll2 hours ago|

[-]

WASM is platform independent.

What do you mean by C bindings? C bindings to what?

by grodes1 hours ago|

[-]

C bindings to a C implementation

by yung_lean54 minutes ago|

[-]

This isn't using WASM to solve the "how can I make my file format compatible with more programming languages?" problem. This is trying to solve the "how can I add new encodings to my file format without making everyone update their code?" problem. The former would rightly be solved with C bindings that anyone can link with if they want. The latter might not seem like a big deal, but it's been the main blocker advancing the parquet format. Most people end up not caring about new advanced encodings and just write parquet files with the most compatible feature set.

by coldtea59 minutes ago|

[-]

C bindings are not platform independent, nor do they come with a runtime and a sandbox, among other things. Apples to oranges.

by andrewstuart22 hours ago|

[-]

I would call it clever. I'm not sure I'd call it genius.

When I'm working with data I'm working in a specific set of languages. Usually one. Yeah, other people might be working in other languages, but no individual author really needs a language-agnostic way of accessing data beyond compile time. Add to that the likely runtime boundaries that may need to be crossed instead of e.g. inlined by the compiler because it's in-language and dealing with known offsets or tags (depends on the data format of course). To the other commenter's point, am I going to have to sandbox all data access code just to be sure it's not able to do something unexpected? There's a lot of complexity here. And the inherent risk is going to slow down the operation that should be the simplest and fastest: interpreting bytes.

by yung_lean1 hours ago|

[-]

A big problem with parquet, which this aims to replace, is that it's hard to add new encodings because everyone wants to stay compatible with old readers. Embedding the decoders in the file as WASM solves this problem since in theory, old readers will be able to read new files by just using the provided WASM to decode a column whose format the reader doesn't recognize.

So this is really about making a file that is forwards compatible in a way that lets you push the standards more than existing formats.

by coldtea56 minutes ago|

[-]

>no individual author really needs a language-agnostic way of accessing data beyond compile time.

That's so untrue! People need language-agnostic ways to access data all the time, and people work with data accessing them from multiple languages all the time!

If I have parquet files I can load them in duckdb, in pandas and polars, process them with various independent tools, and loads of other things... and people do that.

This is also why people like something like an SQL database, your data is not locked to some specific language / lib for access.

by verdverm2 hours ago|

[-]

Is embedding executable code into a file a security risk? My assumption is a yes

by mirashii2 hours ago|

[-]

That would be why it chose a VM that is explicitly designed for sandboxing rather than native executable code or similar, the risk can be minimized by reducing the surface area available to that executable code to almost nothing.

by msla2 hours ago|

[-]

> Is embedding executable code into a file a security risk?

Yes, which is why nobody uses PDFs.

by NooneAtAll358 minutes ago|

[-]

which is why no sane pdf viewer implements executable features*

by bguebert49 minutes ago|

[-]

I mean I disable javascript embedded in pdf and feel like it would have been better to not have that feature. It would spare people from the invoice.pdf email attachment viruses because most people had assumed pdf isn't going to be as bad as an exe.

by nine_k1 hours ago|

[-]

TrueType and OpenType fonts include code executed by a VM to even render them. This wasn't a viable source of attacks so far, due to the properly limited nature of the VMs.

Maybe I would pick the eBPF VM instead, with all its limiting and verifying mechanics.

by cmiles741 hours ago|

https://learn.microsoft.com/en-us/security-updates/SecurityB...

[-]

> This security update resolves a publicly disclosed vulnerability in Microsoft Windows. The vulnerability could allow remote code execution if a user opens a specially crafted document or visits a malicious Web page that embeds TrueType font files.

> This security update is rated Critical for all supported releases of Microsoft Windows. For more information, see the subsection, Affected and Non-Affected Software, in this section.

> The security update addresses the vulnerability by modifying the way that a Windows kernel-mode driver handles TrueType font files. For more information about the vulnerability, see the Frequently Asked Questions (FAQ) subsection for the specific vulnerability entry under the next section, Vulnerability Information.

by tedd4u1 hours ago|

[1] https://www.bleepingcomputer.com/news/security/facebook-disc...

[-]

There are many documented, exploited-in-the-wild font-file attacks (one example in 1]). Apple is re-writing their font interpreter specifically to improve security. [2]

[2] https://blakecrosley.com/blog/truetype-hinting-swift-migrati...

[-]

There is no concept of "executable" vs "non-executable" content in a file.

A file is a bag of bytes. You can send those bytes to different things, like a text editor's content-stream, or as the input to a WASM interpreter.

What you decide to do with the bytes in a file is your own prerogative. Each byte is whatever you make of it.

by jedberg2 hours ago|

[-]

Sure, but when the standard says "read this file and execute the instructions you find at the beginning" that is more dangerous than "this is a file with data and your program needs to figure out how to read it".

[-]

I guess it's a good thing that the F3 standard does not say "read this file and execute the instructions you find at the beginning", then?

The WASM encoders/decoders are embedded resources that exist as byte offsets in the file metadata, not header info.

by jedberg1 hours ago|

[-]

Ok if you want to be pedantic, the standard says, "if you can't read this file, go to the offset and then execute the code you find" which isn't functionally different from what I said.

by ratorx2 hours ago|

[-]

There’s a big difference in the expected use of a file. If the file is attacker provided, and the fallback path is being used, the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.

Compare that to JSON. The parser NEVER needs to execute arbitrary instructions. Parser might have bugs, but it avoids a whole class of issues.

by gavinray1 hours ago|

[-]

  >  the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.

And then do what with it?

WASM physically cannot interact with the underlying host or perform I/O -- you need a WASI environment for that.

by ratorx1 hours ago|

[-]

Putting aside the WASM sandboxing (I’m not familiar enough with it to understand how sandboxing works) there’s a DoS vector at least. Even regexes have had many DoS issues, and I can’t imagine WASM being easier to sandbox for DoS risk.

by 73737373731 hours ago|

[-]

There exist Wasm interpreters capable of limiting the number of instructions executed.

[-]

Many can, even if they have JITs, e.g. Wasmtime. Failing that, it's not that hard to add bytecode instrumentation that will count instructions and terminate early. Some execution platforms that utilize Wasm just inject bytecode instrumentation into guest programs before sending them to the Wasm engine. It's relatively easy to do and not that much overhead.

by bguebert45 minutes ago|

[-]

I mean json might not be the best example since for a long time people would run json through a javascript engine to parse it but I can see your point.

by jastanton2 hours ago|

[-]

gotcha, so the vulnerability will be in some common libraries that attackers force some wasm fallback path with custom wasm instructions that when executed does something nefarious.

I'd say at worst it's setup for poor security

by outside12342 hours ago|

[-]

I mean can't we say the same thing about sending around a .exe though?

by bluejekyll2 hours ago|

[-]

.exe has bindings to OS ABI and system calls, WASM doesn’t have this by default, it’s up to the VM to provide whatever environment the WASM executable needs, ideally there should be no system calls, no stdio, just instructions on how to interpret the file format.

[-]

Double-clicking an ".exe" (or running it via a shell) is not the same as "bag of bytes", it's "send these bytes to an executable environment".

Doing `head foo.exe` is quite different than `run foo.exe`

If I encode executable instructions in "image.png" and then send them to an interpreter that runs those instructions, the file extension doesn't matter.

by jastanton2 hours ago|

[-]

exactly

by sieabahlpark1 hours ago|

[-]

[dead]

by vouwfietsman57 minutes ago|