undefined

points

[-]

What we learned while building this is every token matters in the context, we spend lot of time watching logs of agent sessions, changing the tool params, errors returned by tools, agent prompts, etc...

We noticed for example the importance of letting the model pull from the context, instead of pushing lots of data in the prompt. We have a "complex" error reporting because we have to differentiate between real non-retryable errors and errors that teach the model to retry differently. It changes the model behavior completely.

Also I agree with "significant weight of human input and judgement", we spent lots of time optimizing the index and thinking about how to organize data so queries perform at scale. Claude wasn't very helpful there.

by whoami40417 hours ago|

parent|

[-]

Very interesting work here, no doubt. It's a measured approach to using an LLM with SQL rather than trying to make it responsible for everything end-to-end.

by SignalStackDev6 hours ago|

parent|

prev|

[-]

[dead]

by blharr7 hours ago|

prev|

[-]

"LLMs are good at [task I'm not good enough at to tell the LLM is bad at]" is becoming common

by dylan6048 hours ago|

prev|

[-]

> IMO analytics/SQL will always be a space that needs a significant weight of human input and judgement in generating.

Isn't that precisely what is done when prompting?

by whoami40417 hours ago|

parent|

[-]

The key to my point is in the word "generating". Meaning human input/judgement by actually typing more SQL than the LLM produces. The model's reasoning and code generation pipelines are typically 2 separate code paths, so it may not always actually do what it intends which can lead to unexpected results.

by aluzzardi6 hours ago|

prev|

[-]

> My experience with LLM generated SQL in OLTP and OLAP platforms has been a mixed bag

Models are evolving fast. If your experience is older than a few months, I encourage you to try again.

I mean this with the best intentions: it's seriously mind boggling. We started doing this with Sonnet 4.0 and the relevance was okay at best. Then in September we shifted to Sonnet 4.5 and it's been night and day.

Every single model released since then (Opus 4.5, 4.6) has meaningfully improved the quality of results

by whoami40416 hours ago|

parent|

[-]

I totally agree. However, none of them are infallible and never will be. They're nondeterministic by nature. There is an interesting psychological nuance that I've noticed even in myself that comes with AI assistance in coding, and that's the review/approval fatigue. The model could be chugging along happily for hours and make a sudden, terrific error in the 10th hour after you've been staring at reasoning and logs endlessly. The risk of missing the terrific error in that moment is very high at the tail end of the session. The point I was making (poorly) is that in this specific domain, where businesses are making data-driven decisions on output and insights that can determine the trajectory of the entire organization, human involvement is more critical than, say, writing something like a python function with an LLM.

by shad425 hours ago|

parent|

[-]

I agree, we automated in the Mendral agent what is time consuming for human (like debugging a flaky test), but it will need permission to confirm the remediation and open a PR.

But it's night and day to fix your CI when someone (in this case an agent) already dug into the logs, the code of the test and propose options to fix. We have several customers asking us to automate the rest (all the way to merge code), but we haven't done it for the reasons you mention. Although I am sure we'll get there sometimes this year.

by whoami40414 hours ago|

parent|

[-]

Shameless plug here for Lexega—a deterministic policy enforcement layer for SQL in CI/CD :) https://lexega.com

There are bridges here that the industry has yet to figure out. There is absolutely a place for LLMs in these workflows, and what you've done here with the Mendral agent is very disciplined, which is, I'd venture to say, uncommon. Leadership wants results, which presses teams to ship things that maybe shouldn't be shipped quite yet. IMO the industry is moving faster than they can keep up with the implications.