upvote
You still need to do a forward pass per token. With massive batching and full pipelining you might be able to break the dependencies and output one token per cycle but clearly they aren't doing that.
reply
More aggressive pipelining will probably be the next step.
reply
Reading from and to memory alone takes much more than a clock cycle.
reply