upvote
Basically, v0 and v1 of the repo is completely different implementations, written almost from scratch. Now working on the 3rd one implementation, I believe the last one. :) Completely different architectural choices was made.
reply
If it's still running on more than a single core, and your students want it to go faster, aligning the work to cpus will almost certainly be useful.

I saw you mentioned windows development elsewhere. You might be interested to know that Microsoft pionered Receive Side Scaling and Send Side Scaling. If you try your proxy out on Windows, be sure to hook into those systems there.

The less work your proxy does, the more important avoiding cross core communication is.

reply
Pin threads to cores, and make sure threads different cores aren’t writing to the same 64 or 128 byte block. Lookup “false sharing”
reply
Thanks for the write up. So the first version was synchronous, second version was using epoll and third one will be use io_uring?
reply
I would be interested to see benchmarks for that patch
reply
I don't have the right setup to make good benchmarks for this right now, but when I had the chance to put it into practice, the improvement between no cpu alignment and full alignment was quite large. That was on a 28 core machines (with 16 nic queues); many years ago, but IIRC, I got at least 10x the connections/sec out of the boxes after tuning and after tuning 12 cores were idle ... the machines were repurposed, if they were ordered for this, they should have had one core per nic queue in a single socket. The difference is likely smaller on a 4 core machine as described in the article.

The hardest part is going to be generating enough load. I had production load, which has the benefit that you don't need to generate it. Otoh, it was a transitional need, and I couldn't reasonably test above 50% of peak traffic on a single machine ... I hit that mark around the time traffic started dropping, and then it wasn't fun anymore.

reply
eke
reply