undefined

upvote

points

by xaskasdf23 hours ago |

upvote

by derstander22 hours ago|

[-]

*32MB of RAM (plus 4MB of video RAM and a little sound and IOP memory).

reply

upvote

by eleventyseven18 hours ago|

[-]

> I don't have 30k bucks to spare on a gpu :(

Do you have $2/hr to rent an RTX 6000 96GB or $5/hr for B200 180GB on the cloud?

reply

upvote

by xaskasdf9 hours ago|

[-]

I thought about that, but idk if they allow me to modify the linux kernel and nvidia cuda kernel at all

reply

upvote

by jonassm6 hours ago|

[-]

In those systems you could probably leverage something like Nvidia SCADA or GDS directly.

reply

upvote

by xaskasdf4 hours ago|

[-]

Actually since they have direct GDS it should perform really well on professional gpus

reply

upvote

by green-salt7 hours ago|

[-]

I think you can do a bunch of that on Digitalocean's GPU droplets.

reply

upvote

by superkuh18 hours ago|

[-]

I'd rather not give money to scalper barons if I can avoid it. Fab capacity is going to that for rental rather than hardware for humans.

reply

upvote

by anoncow20 hours ago|

[-]

3000 tokens per sec on 32 mb Ram?

reply

upvote

by fc417fc80219 hours ago|

[-]

fast != practical

You can get lots of tokens per second on the CPU if the entire network fits in L1 cache. Unfortunately the sub 64 kiB model segment isn't looking so hot.

But actually ... 3000? Did GP misplace one or two zeros there?

reply

upvote

by xaskasdf8 hours ago|

[-]

I wondered the same, but the rendering seems right, the output was almost instant. I'll recheck the token counter; anyway as you say, fast isn't practical. Actually I had to develop my own tiny model https://huggingface.co/xaskasdf/brandon-tiny-10m-instruct to fit something "usable", and it's basically a liar or disinformation machine haha

reply