Any idea how they ensure this doesnt happen? As in, how can a user verify that the model did not touch any of the numbers and that it only built pipelines for them.
what I've been telling my CFO who wants to get AI involved in things is that for a lot of accounting and finance work "Trust but verify" doesnt work because verify is often the same process as doing the work.
Build a deterministic query set and automate it for monthly or daily reporting reconcilliation.
Leave AI out of it.
How do you verify that all the tariffs are properly allocated to the correct GL code without going through the invoices and checking for each tariff on the list? How do you make sure none were accidentally assigned to other GL codes? All you have is pdfs, you dont know what the AI did or didnt do with the info on the pdf, there are not many use-cases to catch its errors without doing the work yourself.
If anything, it's going to add a step to these "kids" work where they have to use the AI to do the work and then redo 90% of the work anyway just to verify the output and then AI is going to get the credit anyway.
Or the overworked people are going to use AI and not verify it, which means not catching any errors or hallucinations, which apparently is fine because someone claims it's a solved problem for the black box of infinite possibility and inconsistent output.
When management signs off on work (SOX requires CEOs and CFOs to personally certify the accuracy of financial reports), they do not personally 'verify that all the tariffs are properly allocated to the correct GL code' or nearly any other hard numbers. The world works with human-level best effort, and management of that risk. I'm sure additional checks will be developed to categorize that risk, but the entire field of finance is about analyzing and pricing in risk so I think it'll work just fine.
For anything math, it’s much more reliable to give agents tools. So if you want to verify that your real estate offer is in the 90–95th percentile of offerings in the past three months, don’t give Claude that data and ask it to calculate. Offload to a tool that can query Postgres.
Similar with things needing data from an external source of truth. For example, what payers (insurance companies) reimburse for a specific CPT code (medical procedure) can change at any time and may be different between today and when the service was provided two months ago. Have a tool that farms out the calculation, which itself uses a database or whatever to pull the rate data.
The LLM can orchestrate and figure out what needs to be done, like a human would, but anything else is either scary (math) or expensive (it using context to constantly pull documentation.)
I feel like there’s a metaphor in there... maybe I’ll ask Claude about it.
Everyone wants in on my daily excel auto generated reports - nobody ever opens them. Just being on the list makes you someone.
My money's on that.
I’ve also had some great results with a /reflect skill that asks the agent to look at the work in the broader context of the project. But those are the only two skills I use regularly that aren’t specific to our company, codebase, or tools.
The AI is an expert in both following and generating prompts.
I think that LLMs are trained on the millions of vibe written LLM blog posts that are more superstition than fact. There is a lot of snake oil out there that is treated as fact. If someone claims that an LLM is better than humans at something I always want to see the rigorous evaluations that have been done to quantify it, not "but they're trained on everything!"