upvote
Chip speed has increased faster than memory speed for a long time now, leaving DRAM behind. GDDR was good for awhile but is no longer sufficient. HBM is what's used now.

The last logical step of this process would be figuring out how to mix the CPU transistors with the RAM capacitors on the same chip as opposed to merely stacking separate chips on the same package.

A related stopgap is the AI startup (forget which) making accelerators on giant chips full of SRAM. Not a cost effective approach outside of ML.

reply
Cerebras?
reply
For larger contexts, the bottleneck is probably token prefill instead of memory bandwidth. Supposedly prefill is faster on the M5+ GPUs, but still a big hurdle for pre-M5 chips.
reply