undefined

points

[-]

Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.

We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.

by brcmthrowaway3 hours ago|

parent|

[-]

Intel contributes to Linux, how is this a problem?

by fc417fc8022 hours ago|

parent|

[-]

Wrong level of abstraction. NUMA is an additional layer. If the program (script, whatever) was written with a monolithic CPU in mind then the big picture logic won't account for the new details. The kernel can't magically add information it doesn't have (although it does try its best).

Given current trends I think we're eventually going to be forced to adopt new programming paradigms. At some point it will probably make sense to treat on-die HBM distinctly from local RAM and that's in addition to the increasing number of NUMA nodes.

by wmf3 hours ago|

parent|

prev|

[-]

Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.

by lich_king4 hours ago|

prev|

[-]

I don't think there are any fundamental bottlenecks here. There's more scheduling overhead when you have a hundred processes on a single core than if you have a hundred processes on one hundred cores.

The bottlenecks are pretty much hardware-related - thermal, power, memory and other I/O. Because of this, you presumably never get true "288 core" performance out of this - as in, it's not going to mine Bitcoin 288 as fast as a single core. Instead, you have less context-switching overhead with 288 tasks that need to do stuff intermittently, which is how most hardware ends up being used anyway.

by Retr0id4 hours ago|

parent|

[-]

Maybe no fundamental bottlenecks but it's easy to accidentally write software that doesn't scale as linearly as it should, e.g. if there's suddenly more lock contention than you were expecting, or in a more extreme case if you have something that's O(n^2) in time or space, where n is core count.

by rishabhaiover3 hours ago|

prev|

[-]

That's a great point. Linux has introduced io_uring, and I believe that gives us the native primitives to hide latency better?

But that's just one piece of the puzzle, I guess.

by whateverboat5 hours ago|

prev|

[-]

I think linux can handle upto 1024 cores just fine.

by zokier4 hours ago|

parent|

[-]

afaik the mainline limit is 4096 threads. HP sells server with 32 sockets x 60 cores/socket x 2 threads/core = 3840 threads, so we are pretty close to that limit.

by Retr0id3 hours ago|

parent|

[-]

I had no idea we had socket counts so high, do you know where I could find a picture of one?

by zokier1 hours ago|

parent|

[-]

It's bit cheating because it's cluster based system: https://www.hpe.com/psnow/doc/a50004268enw

So 4 sockets per chassis, up to 8 chassis in a complete system. Afaik OS sees it as single huge system, that is kinda their special sauce here.

by to11mtm1 hours ago|

parent|

prev|

[-]

Sounds like a HPE Compute Scale-up Server 3200, but again keep in mind that's something where there's probably a fabric between nodes one way or another.

by moffkalast3 hours ago|

parent|

prev|

[-]

https://xkcd.com/619/

by jeffbee4 hours ago|

prev|

[-]

There definitely are bottlenecks. The one I always think of is the kernel's networking stack. There's no sense in using the kernel TCP stack when you have hundreds of independent workloads. That doesn't make any more sense than it would have made 20 years ago to have an external TCP appliance at the top of your rack. Userspace protocol stacks win.

by fc417fc8021 hours ago|

parent|

[-]

Do the partitioned stacks of network namespaces share a single underlying global stack or are they fully independent instances? (And if not, could they be made so?)

by wmf1 hours ago|

parent|

[-]

Usually network namespaces are linked together with a single bridge so you can get lock contention there.

If you have a separate physical NIC for each namespace you probably won't have any contention.

by jeffbee1 hours ago|

parent|

[-]

I think you could get much of the way there by isolating a single NIC's receive queues, so the kernel doesn't decide to run off and service softirqs for random foreign tasks just because your task called tcp_sendmsg.

by rishabhaiover3 hours ago|

parent|

prev|

[-]

io_uring?

by jeffbee3 hours ago|

parent|

[-]

If anything, uring makes the problem much worse by reducing the cost of one process flooding kernel internals in a single syscall.

by to11mtm1 hours ago|

prev|

[-]

> OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.

I mean....

IMO Erlang/Elixir is a not-terrible benchmark for how things should work in that state... Hell while not a runtime I'd argue Akka/Pekko on JVM Akka.Net on the .NET side would be able to do some good with it...[0] Similar for Go and channels (at least hypothetically...)

[0] - Of course, you can write good scaling code on JVM or CLR without these, but they at least give some decent guardrails for getting a good bit of the Erlang 'progress guaranteed' sauce.

by user59944614 hours ago|

prev|

[-]

> I wonder whether the next bottleneck becomes software scheduling rather than silicon

Yep, the scheduling has been a problem for a while. There was an amazing article few years ago about how the Linux kernel was accidentally hardcoded to 8 cores, you can probably google and find it.

IMO the most interesting problem right now is the cache, you get a cache miss every time a task is moving core. Problem, with thousands of threads switching between hundreds of cores every few milliseconds, we're dangerously approaching the point where all the time is spent trashing and reloading the CPU cache.

by 01HNNWZ0MV43FF3 hours ago|

parent|

[-]

I searched for "Linux kernel limited to 8 cores" and found this

https://news.ycombinator.com/item?id=38260935

> This article is clickbait and in no way has the kernel been hardcoded to a maximum of 8 cores.

by user59944613 hours ago|

parent|

[-]

That's the one. Funny thing, it's not actually clickbait.

The bug made it to the kernel mailing list where some Intel people looked into it and confirmed there is a bug. There is a problem where is the kernel allocation logic was capped to 8 cores, which leaves a few percent of performance off the table as the number of cores increase and the allocation is less and less optimal.

It's classic tragedy of the commons. CPU have got so complicated, there may only be a handful of people in the world who could work and comprehend a bug like this.