Arguably DRAM-based GPUs/TPUs are quite inefficient for inference compared to SRAM-based Groq/Cerebras. GPUs are highly optimized but they still lose to different architectures that are better suited for inference.
The way modern Nvidia GPUs perform inference is that they have a processor (tensor memory accelerator) that directly performs tensor memory operations which directly concedes that GPGPU as a paradigm is too inefficient for matrix multiplication.