So one-shotting a game of Snake should be great (tons of training data, errors are easily caught because it's a small program). Similar with building a lot of web UI front end, or one-shotting a personal project. On the other hand, I haven't been convinced that it's good enough to maintain large codebases or assist with niche topics that are not very well documented.
This became evident to me the moment I tried to have these models work on some PowerShell tasks for me. Even Opus today struggles with PowerShell.
Since anything in PS is probably some internal sysadmin tool, there's not much public code out there outside of Microsoft's documentation. Plus the Verb-Noun naming scheme makes it really easy to just hallucinate cmdlets (which it does, often). Its easier to have the LLM just do things in python using M365 Graph API than any of the provided PowerShell cmdlets.
OTOH, I've been using Claude for a lot of Swift & Swift UI work lately and it has no problems there, and I'd imagine there's even less publicly available training data for that so to be honest I'm not entirely sure why it fails so badly at powershell.
I use it to wrap ping.exe with colors and fewer columns, for example. yt-dlp wrapper to fetch 480p bestaudio with English subtitles, no playlist, works on a surprising number of video sites.
It does make cmdlets up, you're right, there.
Same is true of humans. So far my experience is that addressing the issue with the help of AI is faster than not (ie comprehending the system and creating the documentation).
This feels a bit like whataboutism.
It also feels like people don't listen to each others.
For example, reading the previous comment, it feels like the thing that reduce the enthusiasm was that at first GenAI looks like it was "reading, understanding and using its own knowledge to answer the problem", but as soon as it is a ore niche or a more complex situation, GenAI looks like it "does not understand the code, just does the equivalent of a StackOverflow search and try to apply the solutions that it found there, and this is why it felt like it understood the code before".
It does not at all means that GenAI is not terribly useful. And even better than humans in some situations.
But it feels that answering "same with humans" is missing this point: that's the opposite, humans usually try to understand the code and are bad at covering a very large range of very well documented subjects. That's the "uncanny valley" they talk about: they assumed GenAI performance on a subject X is due to a "human-like" approach, and it feels very strange when this impression falls apart.
The comment you answer to says that their experience is that AI and the human brain are not analogous and that AI is good to store large amount of knowledge and repeat it (or extrapolate based on pattern on the large amount of knowledge), but bad at understanding the code as a human does. Which explains why a human is more efficient when reacting on a thing that don't have a lot of documentation (on which the AI built its knowledge).
Humans are bad at storing large amount of knowledge, and this is why we need supervisor for human.
AI are bad to understand new stuff, they need to be able to connect the new stuff with a lot of examples they have been trained on (it does not mean the stuff is "identical", but it means "connected"), and this is why we need supervisor for AI.
We need supervisors for both human and AI, but for different uncorrelated reason.
It's the famous "email broken, fix pls" but in the form of an LLM prompt.
It can be frustrating to observe people interacting with these things. But it was just as frustrating 20 years ago, so maybe it's just a constant.
I don't think this is just about intention and willingness, it's just simply hard.
Or... were you illustrating?