undefined

upvote

points

by rom1v3 hours ago |

upvote

by Animats1 hours ago|

[-]

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.

This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.

QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.

reply

upvote

by dcrazy59 minutes ago|

[-]

Don’t pretty much all OSes implement process startup in userspace? On macOS, the kernel creates a process with an image of dyld and points it at dyld_start, which actually takes care of parsing the Mach-O header. I assumed ld.so does the same job on Linux.

reply

upvote

by lukan1 hours ago|

[-]

It is almost as if you agree with the authors ..

"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"

(But thanks for the good explanation)

reply

upvote

by anarazel3 hours ago|

[-]

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.

reply

upvote

by mort962 hours ago|

[-]

The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1, and you need overcommit.

Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.

reply

upvote

by marcosdumay1 hours ago|

[-]

CoW is probably a good idea whether you use fork or not. Or rather, fork is probably a better option than just exec exactly because it can benefit from CoW.

At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.

reply

upvote

by mort9639 minutes ago|

[-]

CoW is probably a good idea regardless, yeah. Overcommit is more questionable. Regardless, both ought to be argued based on their own merits. It's unfortunate that both are necessary as a consequence of fork().

reply

upvote

by Someone55 minutes ago|

[-]

> The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1

It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.

On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.

That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.

I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.

reply

upvote

by foresto54 minutes ago|

[-]

> The problem with fork isn't really that it's slow.

Did someone suggest that it was?

reply

upvote

by mort9644 minutes ago|

[-]

anarazel's comment focuses entirely on performance, indicating that they have an impression that the discussion about why fork is bad is about performance. I'm not entirely sure where this impression came from, as it's not mentioned in rom1v's quote nor a point in the linked paper, "A fork() in the road".

reply

upvote

by dapperdrake1 hours ago|

[-]

How else does consistency work, then?

Only being half facetious here. Maybe you or someone else really has a better take.

reply

upvote

by mort961 hours ago|

[-]

What do you mean by consistency here?

reply

upvote

by theK2 hours ago|

[-]

Didn't he just say that fork turns out to be comparatively faster to the non-fork samples we get? Ie Linux spawns processes faster than Microsoft's kernels?

reply

upvote

by mort961 hours ago|

[-]

Didn't I just say that "the problem with fork isn't really that it's slow"? It's all the other OS design choices it forces on you if you want it to be fast.

reply

upvote

by theK1 hours ago|

[-]

Right, you did. I somehow misread your comment.

reply

upvote

by nvme0n1p12 hours ago|

[-]

We don't have any broadly used non-fork samples. Windows, macOS, and Linux all have fork. So the presence of fork can't be the reason for the performance difference.

(Windows's fork is called ZwCreateProcess)

reply

upvote

by Someone48 minutes ago|

[-]

MacOS has posix_spawn. See https://developer.apple.com/library/archive/documentation/Sy... (yes, that’s an iOS man page. MacOS has the call, too, but I couldn’t find the man page online and it looks identical to me)

I don’t know how they implemented it, though. Under the hood, it could do the equivalent of a fork/exec pair.

reply

upvote

by plorkyeran18 minutes ago|

[-]

XNU's posix_spawn implementation is not fork/exec-based. It does roughly what the API suggests it would do.

reply

upvote

by dcrazy1 hours ago|

[-]

NtCreateProcess does not implement a forking model. It is analogous to posix_spawn.

reply

upvote

by pjmlp3 hours ago|

[-]

Because that OS best practices is to use threads.

Traditionally Windows applications that create processes all the time come from UNIX heritage.

Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.

While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.

reply

upvote

by PaulDavisThe1st1 hours ago|

[-]

A more accurate way to describe this is that Windows' (NT onward) core execution context model is a bunch of threads that by default share memory, whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.

Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.

You're right that POSIX semantics get tangled when using threads.

reply

upvote

by pjmlp1 hours ago|

[-]

Well, Windows before NT isn't the same design as Windows 16 bit, it only shares the name for all practical purposes, and has more influence from OS/2 than Windows 16 bit.

Which is why I took the effort to explicitly refer to Windows NT on my comment, already expecting some traditional answers from UNIX folks.

Also due to historical reasons POSIX threads are the outcome of every UNIX going their own way implementing threads, finally coming to an agreement years later, with all the plus and minus of relying in POSIX for portable code.

reply

upvote

by snozolli1 hours ago|

[-]

whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

How are those not simply child processes? I don't understand your use of the word 'threads' here.

Does the Unix world not distinguish between threads and processes? In Win32, threads exist within processes, and you can create new threads or child processes.

reply

upvote

by trumpdong56 minutes ago|

[-]

They are child processes.

Second answer: Linux doesn't differentiate between threads and processes. It has a "thread group ID" that serves a small number of purposes, and the rest of the difference is just whether the threads happen to share the same address space.

reply

upvote

by pjmlp1 hours ago|

[-]

Actually on Windows a process is a thread with additional information.

The unit of execution is the thread.

On the UNIX world it depends on which UNIX you are talking about.

Linux has a similar model to Windows NT nowadays, hence clone() as key primitive.

Other UNIXes have different approaches.

reply

upvote

by sunshowers1 hours ago|

[-]

The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.

reply

upvote

by pjmlp58 minutes ago|

[-]

True, but on Windows the approach is then to use COM servers, which have a faster IPC model, and can even serve multiple clients, depending on how the appartement space is configured.

reply

upvote

by mort9643 minutes ago|

[-]

"Faster IPC model" than what? Faster than writing to and reading from a pipe? Faster than POSIX shared memory?

reply

upvote

by zozbot2342 hours ago|

[-]

Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.

reply

upvote

by JdeBP2 hours ago|

[-]

Windows NT was never designed with pre-386 machines in mind. That was the territory of the old DOS+Windows. Windows NT from the get-go was for machines with page-based virtual memory.

* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...

reply

upvote

by pstuart1 hours ago|

[-]

WinNT 3.5 was a solid offering.

reply

upvote

by epcoa2 hours ago|

[-]

This is not true. NT never had fork, was always based on the assumption of an MMU and Dave Cutler was a well known fork hater in the 80s long before this paper came out and made it cool to be so. By the time Windows 95 was out, the baseline was 386 with an MMU. CreateThread was initially designed for NT in 1993 though (which didn’t support pre-386 CPUs).

reply

upvote

by keitmo1 hours ago|

[-]

NT performed unnatural acts to implement fork semantics for the POSIX subsystem.

reply

upvote

by 51 minutes ago|

[-]

deleted

reply

upvote

by JdeBP1 hours ago|

[-]

As mentioned elsewhere on this page, Windows NT had fork from the start. Vide NtCreateProcess and what happens if an image file is not explicitly supplied.

* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...

reply

upvote

by dcrazy1 hours ago|

[-]

NtCreateProcess doesn’t accept an image file parameter.

reply

upvote

by JdeBP27 minutes ago|

[-]

You haven't read the doco. I did point to some. The image file is supplied (or not) via the section object.

Think it through. Windows NT supported fork from the start in its POSIX subsystem, that subsystem was layered on top of the Native API, and this is the Native API mechanism that the POSIX subsystem employed. Although it took until Gary Nebbett for someone to publicly show how, even though people knew informally back in 1993.

reply

upvote

by dcrazy47 minutes ago|

[-]

NT was designed to be platform-agnostic, and its original target was the DEC Alpha. Its process model owes nothing to pre-386 CPUs. The WinAPI CreateProcess function is a layer atop NtCreateProcess, so that is where the pre-386 heritage lives. But even the WinAPI process model changed significantly with 32-bit Windows.

reply

upvote

by pjmlp1 hours ago|

[-]

Windows NT!

Misread on purpose to make a point?

reply

upvote

by knome1 hours ago|

[-]

the only difference between a thread and a process on linux is how many structures they share. the function is identical.

reply

upvote

by aseipp2 hours ago|

[-]

I suspect it's a long tail sort of thing; it mostly doesn't matter except when it really matters. It's interesting that the stated motivation for the patch is in the context of agentic tools spawning subcommands. There's some related prior art in this area where the payoffs could be much greater, like fuzzing: https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf is an example. It would be very interesting to see this patch applied to e.g. AFL++

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by nvme0n1p12 hours ago|

[-]

That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.

reply

upvote

by dcrazy1 hours ago|

[-]

Again, NtCreateProcess does not implement fork(). The fundamental characteristic of fork is that the child is an exact replica of the parent, down to the instruction pointer. Windows does not have a way to create a process object with such a configuration.

Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.

reply

upvote

by aseipp3 hours ago|

[-]

This paper is great and I also really like one of its references [29] as it goes into some more subtle parts of scalable interfaces, including fork. It's a gem IMO: The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors https://people.csail.mit.edu/nickolai/papers/clements-sc.pdf

reply

upvote

by omoikane3 hours ago|

[-]

Discussion at the time:

https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)

reply

upvote

by jwilk1 hours ago|

[-]

Discussed also in 2021: https://news.ycombinator.com/item?id=29709802 (16 comments)

reply

upvote

by pizlonator3 hours ago|

[-]

Fork is marvelous for the zygote pattern

Hard to come up with an optimization that is equally efficient and elegant

reply

upvote

by toast02 hours ago|

[-]

The zygote pattern[1] is a great optimization to deal with the cost of forking, but IMHO, being able to inexpensively spawn a carefully tailored process regardless of the size and scope of the current process would be better.

I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.

[1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.

reply

upvote

by pizlonator2 hours ago|

[-]

You're referring to something else, and maybe I'm using the term "zygote" incorrectly.

In all uses of zygotes that I have seen, here's what's really happening:

- `fork` is being used to reduce the cost of starting a process that has a high start-up cost. So, you start one process, run it through the expensive initialization, and then fork it from there to start new processes.

- To make this even faster, you have a pool of pre-forked processes sit around.

- Having pre-forked processes sitting around ready to be used is not expensive because of the CoW property and the fact that a process that forks and then immediately pauses will not have triggered any significant CoW yet.

So, the zygote optimization you speak of is in practice only meaningful on top of systems that are using an optimization uniquely enabled by `fork` (avoiding process initialization costs by cloning a process), and that zygote optimization is further optimized by another property of `fork` (memory sharing of forked processes that haven't done anything else yet).

reply

upvote

by toast02 hours ago|

[-]

Oh I see. I guess your zygotes have developed more than mine. I think Google may have coined or at least popularized the term zygote for this in Chrome and Android, Chrome documentation [1] says:

> A zygote process is one that listens for spawn requests from a main process and forks itself in response. Generally they are used because forking a process after some expensive setup has been performed can save time and share extra memory pages.

I think reading the first sentance and stopping covers my zygote, but adding the second sentance covers yours. So I think we're both right!

I think both paths are useful. If your children need time to startup and become ready, spawn one that does start up work, and then it (pre)forks at the ready state to have processes ready to handle requests (your zygote). This does require a traditional fork() to avoid duplication of work.

But if forking is expensive at runtime because you have a million FDs open and a whole lot of memory allocations, spawn spawners before you start doing work (my zygote). This could be unnecessary with a inexpensive way to spawn a new process from an process that has lots of resources in use.

Of course, you can also use my zygotes to spawn your zygotes. Zygoteception.

[1] https://chromium.googlesource.com/chromium/src/+/HEAD/docs/l...

reply

upvote

by skydhash1 hours ago|

[-]

I quite like the idea. I’m using OpenBSD on an oldish laptop, and fork-exec is expensive enough that it conflicts with the usb subsystem. Isochronous transfers have a 1ms realtime requirement and it seem that the fork-exec system calls hold the giant lock long enough to mess with it (audio stutters).

While I’ve not bothered to profile it, but it seems that process that have lot of mapped pages is the issue (firefox, emacs,…). In the emacs case, the issue is when the main process trying to fork-exec, if I start a shell session (with shell-mode or term-mode), it works fine.

reply

upvote

by PaulDavisThe1st1 hours ago|

[-]

> being able to inexpensively spawn a carefully tailored process regardless of the size and scope of the current process would be better.

It's called clone(2)

reply

upvote

by trumpdong53 minutes ago|

[-]

Which argument to clone starts the process with an empty address space?

reply

upvote

by p_l42 minutes ago|

[-]

And so easy to make into bottleneck.

Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).

If you don't, you might wake up with fork() causing latency issues.

reply

upvote

by vlovich1232 hours ago|

[-]

The paper explicitly covers it that various memory COW/snapshot mechanisms are probably faster and safer than the zygote pattern. As it stands getting the zygote pattern correct and safe is something you have to plan for upfront. You can’t retrofit it which is why the paper mentions it has poor composability. Also the advantages of the zygote pattern can be overstated since the memory sharing benefit is minimal since it has to happen so early and modern OSes already transparently CoW duplicate pages in the background.

reply

upvote

by loeg52 minutes ago|

[-]

In what sense can you not retrofit the zygote pattern?

reply

upvote

by 55 minutes ago|

[-]

deleted

reply