Ask HN: Why does my Node.js multiplayer game lag at 500 players with low CPU?

13 points

[-]

Please also note that Hetzner is not providing CPU steal information inside of VMs. So there could be 75% steal and you wouldn't notice! It's unlikely for CCX instances, but it happened for me a lot with regular instances.

by toast017 hours ago|

prev|

[-]

What are your processes waiting on? in Linux top, show the WCHAN field. In FreeBSD top, look at the STATE field. Ideally, your service processes are waiting on i/o (epoll, select, kqread, etc) or you're CPU limited.

Is there any cross-room communication? Can you spawn a process per room? Scaling limited at 25% CPU on a 4 vcpu node strongly suggests a locked section limiting you to effectively single threaded performance. Multiple processes serving rooms should bypass that if you can't find it otherwise, but maybe there's something wrong in your load balancing etc.

Personally, I'd rather run with fewer layers, because then you don't have to debug the layers when you have perf issues. Do matchmaking wherever with whatever layers, and let your room servers run in the host os, no containers. But nobody likes my ideas. :P

Edit to add: your network load is tiny. This is almost certainly something with your software, or how you've setup your layers. Unless those vCPUs are ancient, you should be able to push a whole lot more packets.

by jbryu15 hours ago|

parent|

[-]

So when running `top` WCHAN shows `ep_poll` most of the time and sometimes `-`. Even when the game starts lagging this pattern stays pretty consistent.

There is no cross-room communication. I could spawn a process per room but I was trying to address this issue with my current Docker setup where I have multiple `game` containers that run a single node.js process and each process can host multiple rooms.

Not having to use Docker sounds simpler but it's that's where I'm at atm haha.

I agree that the network load feels very small. Maybe it's a socket.io related issue where when many broadcasts are being fired at once, then a shared I/O step gets bottlenecked?

Here's my actual typing broadcast code, I was originally broadcasting from the socket event callback itself but I found performance improved slightly by batching broadcasts per player in a setInterval loop (also note that only 1 player in a given room can be typing at once, so batching broadcasts per room shouldn't address the bottleneck).

  /**
   * Used to handle very frequent typing events more gracefully to avoid overloading CPU
   */
  const TypingUsersMap = new Map<
    ConnectionId,
    {
      socketId: string | null; // doesn't exist for bots
      roomId: PublicRoomId;
      userId: UserId;
      currentInput: string;
    }
  >();

  type ConnectionId = `${UserId}:${PublicRoomId}`;

  // ! this should be same as client throttle interval
  const TYPING_BROADCAST_INTERVAL = 200;

  export let typingBroadcastInterval: NodeJS.Timeout | undefined = undefined;
  export const startTypingBroadcastJob = () => {
    typingBroadcastInterval = setInterval(() => {
      const freshTypingUsersMap = new Map(TypingUsersMap);
      TypingUsersMap.clear();

      if (freshTypingUsersMap.size === 0) return; // Nothing to do

      // Go through each user that has a pending update
      for (const [_connectionId, data] of freshTypingUsersMap.entries()) {
        const socket = data.socketId
          ? io.sockets.sockets.get(data.socketId)
          : undefined;

        // Use the data we stored to perform the broadcast
        if (socket) {
          // emit to other players
          socket
            .to(data.roomId)
            .volatile.emit(
              SOCKET_EVENT_NAMES.USER_TYPING_RES,
              data.userId,
              data.currentInput
            );
        } else {
          // bots emit to everyone
          io.to(data.roomId).volatile.emit(
            SOCKET_EVENT_NAMES.USER_TYPING_RES,
            data.userId,
            data.currentInput
          );
        }
      }
    }, TYPING_BROADCAST_INTERVAL);
  };

  export const stopTypingBroadcastJob = () => {
    if (typingBroadcastInterval) {
      clearInterval(typingBroadcastInterval);
      typingBroadcastInterval = undefined;
    }
  };

  // this is called from the USER_TYPING socket event callback. so effectively every throttled keystroke by the user gets queued.
  export const queueTypingEvent = ({
    socketId,
    roomId,
    userId,
    currentInput,
  }: {
    socketId: string | null;
    roomId: PublicRoomId;
    userId: UserId;
    currentInput: string;
  }) => {
    const connectionId: ConnectionId = `${userId}:${roomId}`;
    TypingUsersMap.set(connectionId, {
      socketId,
      roomId,
      userId,
      currentInput,
    });
  };

by octo88815 hours ago|

prev|

[-]

3000 pps / 6 Mbps is pretty much nothing for that server. I wouldn't change random network sysctl options.

> This suggests to me that the bottleneck isn’t CPU or app logic, but something deeper in the stack

Just a word of caution - I have seen plenty of people speed towards eg "it must be a bug in the kernel" when 98% of the time it is the app or some config.

by jbryu15 hours ago|

parent|

[-]

Yeah changing the sysctl options was a shot in the dark... I really hope it's my app code. But the fact that the same bottleneck occurs even when I add more containers which decreases the load per container confuses me. I mentioned this in another comment but I wonder if socket.io broadcast calls share the same I/O resource or something. Maybe a lock?

by ycombinatrix10 hours ago|

prev|

[-]

Are you buffering your output? Doing one syscall (write) for each client in a server for each keystroke is a significant amount of IO overhead and context switching.

Try buffering the outgoing keystrokes to each client. Then, someone typing "hello world" in a server of 50 people will use 50 syscalls instead of 550 syscalls.

Think Nagle's algorithm.

by jbryu8 hours ago|

parent|

[-]

I'm somewhat buffering right now - Everytime the current turn player types I buffer their input on the backend, and I have a job setup that broadcasts typing events every ~200ms using this buffer.

I could increase this interval, but I'd like to keep it as short as I can afford to to keep that realtime feel (i.e. other players can see what the current turn player is typing).

by pvg19 hours ago|

prev|

[-]

It sounds like you want to coalesce the outbound updates otherwise everyone typing is accidentally quadratic.

by jbryu19 hours ago|

parent|

[-]

I thought this might've been the issue too, but because the game is turn-based there should only ever be 1 person typing at once (in a given room).

by pvg14 hours ago|

parent|

[-]

60 * 7 is not all that great either if you get cascading and clumping as people type at the same time- coalescing the outbound updates still seems like a good idea and since the game is turn based you know it's not really going to affect gameplay. You've basically made yourself a first person shooter networking problem for a game that's slower than WoW. That feels like overkill in terms of self-imposed obstacles.

by jbryu13 hours ago|

parent|

[-]

Ahhhh I see what you mean now. You just gave me some good ideas. Alas because of the nature of my game, it will always have first person shooter esque networking problems despite it being turn-based. But it's good to know that I'm dealing with a non-trivial level of throughput.

by brudgers17 hours ago|

parent|

prev|

[-]

there should only ever be 1 person typing at once (in a given room)

Have you verified that is the case?

by jbryu16 hours ago|

parent|

[-]

Yep just triple checked. If distributing the load on a single server by adding more backend containers doesn't decrease ping then maybe this is just the natural upper bound for my particular game... The only shared bottleneck between all backend containers I can think of right now is at the OS or network interface layer, but things still lag even when I tried increasing OS networking limits:

  net.core.wmem_max = 16777216
  net.core.rmem_max = 16777216
  net.ipv4.tcp_wmem = 4096 65536 16777216
  net.ipv4.tcp_rmem = 4096 87380 16777216

Perhaps the reality for low latency multiplayer games is to embrace horizontal scaling and not vertically scaling? Not sure.

by codingdave15 hours ago|

parent|

[-]

Networking bottlenecks are not always on your box - they could be on the router your box is talking to. Or, depending on load, the ethernet packets themselves could be crowding the physical subnet. Do you have a way to mock 500 users playing the game that would truly keep all the traffic internal to your OS? Because if that works, but the lag persists for real players, the problem is external to your OS.

by jbryu15 hours ago|

parent|

[-]

Good point. I actually don't know what performance looks like with 500 real users. The way I'm mocking right now is by running a script on my local machine that generates 500+ bots that listens to events to auto join + play games. I tried to implement the bots to behave as closely to humans as possible. I'm not sure if this is what you mean by keeping traffic internal to my box's OS, but right now this approach creates lag. I didn't consider whether spinning up hundreds of websocket connections from a single source (my local machine) would have any implications when load testing hm

by brudgers14 hours ago|

parent|

prev|

[-]

Networking often scales better horizontally.

Computation can sometimes scale well vertically but proprietary OS’s are more likely to be tuned for it…as a premium feature.

by cbenskxk3 hours ago|

prev|

[-]

are you using uwebsockets.js?

by moomoo118 hours ago|

prev|

[-]

Are you awaiting anywhere, such that you might be better off doing fire n forget instead?

by bigyabai19 hours ago|

prev|

[-]

25% CPU usage could indicate that your I/O throughput is bottlenecked.