In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.
What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.
If you're going to have to fact check everything anyways...why bother using them in the first place?
"If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"
Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.
The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.
And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.
You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.
But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.