Well, what is your definition of "super reliable in the output", and is it a quantifiable/measurable target or just a feeling?
Is it "more than humans", "more than senior developers", "almost perfect", "perfect"?
> It might behave differently than specified and a human is required to validate every output carefully or else.
Sure, just like meatbag developers. All the security flaws AI finds today were introduced years/decades ago by humans and haven't been found (that we know) by humans in ages.
Between ten thousand runs of:
``` const int MAX_COUNT = 10000;
printf("I'll count up to %d", MAX_COUNT); for(int i=1; < MAX_COUNT; i++) printf("I'm now counting %d", i); ```
And of the following prompt:
``` You'll count to 10,000. At the start say "I'll count up to 10,000" and then for each number say "I'm now counting <number>" and do not say anything else. Do not miss numbers in between. ```
Which one is going to produce 100% correct results out of a 10,000 run of each?
Now don't give me "these are different tools". We all know. I'm talking about reliability and predictability.