upvote
The display controller and render device are completely distinct logical devices, even though they are often grouped in a "GPU". On mobile architectures they are quite far separated, leading to annoying problems surrounding what we on Linux call "split drm devices".

Updating plane properties such as to move the cursor plane around or disable it would by itself not block on render activities, as they are completely distinct blocks.

The render hardware could be powered down, but I doubt powering it up and compositing the cursor would take long enough to complete to cause any noticable lag.

Under the Linux APIs, updates to the display controller are done through KMS atomic commits, and one mistake you could do display-server side would be to provide a fence in this atomic commit that the scheduler will use to wait on long-running GPU work before using the provided graphics buffers. Under this API, none of the changes - including mouse movements - would then be applied until that fence is signalled. Changing plane associations can lead to resource reallocations that can be a bit heavy.

Not sure if the kernel driver in macOS works anything remotely similar to this, and the driver could also just be dumb and block on unrelated things ("let's just wait another vblank to see this apply....", "as we only need one plane now let's power down hardware and wait for that to settle..."). It could also just be windowserver that waits for work to finish on its own, not providing any cursor updates in the meantime.

The reality is that it will take reverse engineering or looking at actual code to know what's going on.

reply
Since this is but an iPhone crammed into a laptop, could this behaviour stem from the fact that iPhones generally need not render a cursor?
reply
No, the cursor just uses an overlay plane, and mobile architectures usually have far more planes (sometimes even an arbitrarily configurable amount), and more flexible hardware compositioning overall than desktop GPUs for efficiency reasons.

EDIT: Also note that there is nothing new with the Neo here, as all Macs since the M1 have used the same chip architecture as the iPhone.

Desktop GPU designs did not focus on tiny efficiency gains, and often only has a primary plane, a single overlay plane (for e.g., a video), and a dedicated cursor plane. Some even have to share a single overlay plane between all connected displays. It's a recent thing for desktop GPUs to get more flexible in this area, in part to improve laptop battery life in the cases where the laptop is almost entirely idle.

(For those unaware, a "plane" here is the entity in the display controller you configure to show a rendered graphics buffer, in a particular location and with particular transforms. You commonly have one plane that just covers the whole screen, and then sometimes put dynamic content on top in other planes so you can avoid having to redraw the main buffer when smaller bits of it change, like a video player or cursor. You could also e.g., scroll by rendering an entire document in advance and then move the plane around to reveal parts of it.)

reply
> Desktop GPU designs did not focus on tiny efficiency gains

I'm not sure they're all that tiny if you can squeeze out 70% of top end performance for 25% of the power draw :)

reply
I think the implication is they focussed on the huge efficiency gains, and didn't focus on the small ones?
reply
This is plausible to me as well. A couple years ago we were trying to make dynamic memory allocation in Vello more robust and explored using async readback of a status buffer. In that case, the async task doesn't wake until the command buffer completes and signals a fence back to the CPU.

Long story short, performance was disappointing and we abandoned the approach. It's easy to believe it's a real problem especially when there are other factors including GPU being clocked down to save power.

Same caveat as parent, I have no direct knowledge of MacBook Neo or this specific issue.

reply
This was a really informative and interesting reply articulated in simple enough terms that I am now interested in GPUs, thanks
reply
But wouldn't the software cursor operations also go in the queue? I don't see the problem.
reply
Modern GPUs usually have multiple command queues, at least one for application use (often separate queues for rendering and compute) and one for OS use. There's a good chance that this wasn't implemented on a chip intended for a phone.
reply
For something as small as a cursor they could be doing direct framebuffer writes.
reply
> A hardware cursor is basically a small bitmap that can be positioned anywhere on the screen and moved around by simply updating position registers (which is normally done per mouse interrupt); the hardware draws it automatically

Do modern machines still have custom hardware for cursors? That would surprise me, as a GPU can easily blit a small cursor on top of whatever gets drawn.

reply
The cursor plane is the way to "easily blit a small cursor on top of whatever gets drawn". If the cursor was drawn on the primary plane, the primary plane would need to be redrawn (expensive!) or repaired (messy!) to change the cursor position.
reply
how do hardware cursors work in a composited desktop?

the cursor could just be another small rectangle texture you position on top of the other surfaces. there is no need to read the framebuffer/write into it, its just a z-stack of 3d surfaces now

reply
AFAIK, they work just like in a non-composited desktop.

The problem with rendering the cursor into the primary plane is that, often, only the cursor changes, and you'd have to re-render the whole plane that contains the cursor. That is easily doable for modern hardware, but bad for power consumption and may also be higher latency. (The latency aspect gets interesting when dragging something on the primary plane - I think most compositors temporarily disable the hardware cursor in order to keep cursor and dragged object in sync.)

reply
>A software cursor is manually drawn by the graphics stack, which saves what was under it, draws the cursor, and then whenever it needs to be moved, writes the original data back, saves the data at the new position, and then draws the cursor there.

AFAIK this hasn't been true for a long time on most platforms, certainly on macOS. The desktop image is composited on the GPU by assembling the underlying windows with appropriate effects like shadows and scrolling/scaling. A software cursor is just another overlay which may also have a transparent shadow.

Actually preserving what was under the cursor and putting it back is the sort of thing you wouldn't do anymore, because that's a cache which requires babysitting based on everything that's underneath and around it.

e.g. On macOS there's full screen zooming for accessibility, and if you wiggle the mouse, the cursor grows in size briefly (maybe even too big for hardware cursor to support).

reply
>A software cursor is manually drawn by the graphics stack, which saves what was under it, draws the cursor, and then whenever it needs to be moved, writes the original data back, saves the data at the new position, and then draws the cursor there.

If a hardware layer is not being used the cursor layer will be treated like any other layer in the compositor. Modern compositors don't try and save and write pixels like that. It will just rerender it.

>(which is normally done per mouse interrupt);

It's normally done every frame the compositor makes.

>or it may end up drawing over what software wants to draw

The compositor composites everything at that will be shown on the next refresh of the display. Things don't indepently step on each others toes since it's just the compositor rendering and synchronizing all hardware layers (planes).

reply