undefined

points

by giancarlostoro5 hours ago |

comments

by BillStrong4 hours ago|

[-]

As long as you don't keep calling out to the CPU, that is.

Tool calling, searches, cache movement if used, and even debug steps all stall the GPU waiting for the CPU.

There was a test of turning one of the under 1B Qwen3+ models into a kernel that didn't stall by the CPU as one GPU pass that saw quite a bit f perf lift over vLLM, I believe, showing this is an issue still.

Its been a month, so I don't remember more details than this.

by hashmap2 hours ago|

prev|

[-]

you can port anything python is doing with a couple prompts into rust/c++, including parity validation. when the barrier to migrating is that thin, you are losing money and time even continuing to talk about it. python is miserably slow, so dont let it touch any part of your system. no snakes in the house.

by jmalicki3 hours ago|

prev|

[-]

Pytorch dataloaders are often horribly inefficient, a lot of stuff there can benefit from Rust/C++