Can always build a bigger hall
On the other hand, competition is good - nvidia can’t have the whole pie forever.
And that's the point - what's "reasonable" depends on the hardware and is far from fixed. Some users here are saying that this model is "blazing fast" but a bit weaker than expected, and one might've guessed as much.
> On the other hand, competition is good - nvidia can’t have the whole pie forever.
Sure, but arguably the closest thing to competition for nVidia is TPUs and future custom ASICs that will likely save a lot on energy used per model inference, while not focusing all that much on being super fast.
I disagree. Yes it does matter, but because the popular interface is via chat, streaming the results of inference feels better to the squishy messy gross human operating the chat, even if it ends up taking longer. You can give all the benchmark results you want, humans aren't robots. They aren't data driven, they have feelings, and they're going to go with what feels better. That isn't true for all uses, but time to first byte is ridiculously important for human-computer interaction.
Compare the photos of a Cerebras deployment to a TPU deployment.
https://www.nextplatform.com/wp-content/uploads/2023/07/cere...
https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iOLs2FEQxQv...
The difference is striking.
Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.
Training models needs everything in one DC, inference doesn't.