upvote
*32MB of RAM (plus 4MB of video RAM and a little sound and IOP memory).
reply
> I don't have 30k bucks to spare on a gpu :(

Do you have $2/hr to rent an RTX 6000 96GB or $5/hr for B200 180GB on the cloud?

reply
I thought about that, but idk if they allow me to modify the linux kernel and nvidia cuda kernel at all
reply
In those systems you could probably leverage something like Nvidia SCADA or GDS directly.
reply
Actually since they have direct GDS it should perform really well on professional gpus
reply
I think you can do a bunch of that on Digitalocean's GPU droplets.
reply
I'd rather not give money to scalper barons if I can avoid it. Fab capacity is going to that for rental rather than hardware for humans.
reply
3000 tokens per sec on 32 mb Ram?
reply
fast != practical

You can get lots of tokens per second on the CPU if the entire network fits in L1 cache. Unfortunately the sub 64 kiB model segment isn't looking so hot.

But actually ... 3000? Did GP misplace one or two zeros there?

reply
I wondered the same, but the rendering seems right, the output was almost instant. I'll recheck the token counter; anyway as you say, fast isn't practical. Actually I had to develop my own tiny model https://huggingface.co/xaskasdf/brandon-tiny-10m-instruct to fit something "usable", and it's basically a liar or disinformation machine haha
reply