upvote
I was about to post your last point / quote. Going multigpu is relatively not so though but once you go multi-node you have distributed storage/io/compute system which is highly non trivial. Add that the long training times now you have robustness/fault-tolerantness concerns with hardware failures and restarts. Today’s training systems are engineering marvels.
reply
Good point!

There is also the case for Markov chains being theoretically able to do these if tuned well. Or even SAT problem.

reply
"LLM is just fancy autocomplete"
reply