upvote
Yes, that seems unneccessary. The overhead of trapping and rewriting every syscall instruction once can't be (much) greater than that required for rewriting them at the start either.

Even if you disallow executing anything outside of the .text section, you still need the syscall trap to protect against adversarial code which hides the instruction inside an immediate value:

    foo: mov eax, 0xc3050f    ;return a perfectly harmless constant
         ret
    ...
    call foo+1
(this could be detected if the tracing went by control flow instead of linearly from the top, but what if it's called through a function pointer?)
reply
Thinking a bit more about it (and reading TFA more carefully), what's the point of rewriting the instructions anyway?

I first assumed it was redirecting them to a library in user mode somehow, but actually the syscall is replaced with "int3", which also goes to the kernel. The whole reason why the "syscall" instruction was introduced in the first place was that it's faster than the old software interrupt mechanism which has to load segment descriptors.

So why not simply use KVM to intercept syscall (as well as int 80h), and then emulate its effect directly, instead of replacing the opcode with something else? Should be both faster and also less obviously detectable.

reply
Good point, an int3 is not going to be faster than a syscall, and if they implement the sandboxing policy in guest userspace is seems it would be quite easy to disable.
reply
I think the point here is optimizing for the common case, the untrusted code is still running inside a VM, so you can still trap malicious or corner cases using a more heavy-handed method. The blog post does mention "self-healing" of JIT-generated code for instance.

It is possible to restrict the call-flow graph to avoid the case you described, the canonical reference here is the CFI and XFI papers by Ulfar Erlingsson et.al. In XFI they/we did have a binary rewriter that tried to handle all the corner cases, but I wouldn't recommend going that deep, instead you should just patch the compiler (which funnily we couldn't do, because the MSVC source code was kept secret even inside MSFT, and GCC source code was strictly off-limits due to being GPL-radioactive...)

reply
The follow on posts describe where I plan to run the binaries. the idea is to run in a guest with no kernel and everything running at ring 0 that makes the sysret a dangerou thing to call. we don't have anything running at ring 3 also the syscall instruction clobber some registers all in all between the int3 and syscall instruction i counted around 20 extra instructions in my runtime. ( This is a guess me trying to figure what would happen). That is why the int3 becomes faster for what i am trying to build. The toolchain approach suffers from the diversity of options you have to support even if ignore stuff you guys encountered. Might be easier with llvm based things but still too many things to patch and the movement you tell people used my build environment it meets resistance. I am currently aiming for python which is easy to do. The JIT is when i want to do javascript which i keep pushing out because once i go down there i have to worry about threading as well. Something i want to chase but right now trying to get something working.
reply
Isn't that exactly what gvisor does?
reply
reply
gvisor tries to be a complete kernel in userland we are not trying to. We will consciously choose never to try and support multi-proess env in the sandbox. The idea is there are enough people running single process containers and they can benefit from a lighter more secure runtime. This solution will not try to replace the kernel. For example the python tests we run for https to some website ends up runnign implementing only 60 syscalls not 350. i expect to add another 10-20 for support typescript but this will always be strictly single process.Plus the performance overhead of gvisor is substantial 2-10us ( me reading internet) for the system i am implemeting on the hot path it is less than 1us. Plus there is always the density story my shim currently is 4KB the python runtime is shared through memfd. I am working on a demo showing i can run 1000 vm on 512 MB ram each launching in under 30msec. Remember this will never replace or be able to handle generic mutli-process sandboxes this is targeted only at single process env where we can make lots of simplifying assumptions
reply