We gave terabytes of CI logs to an LLM

upvote

We gave terabytes of CI logs to an LLM

(www.mendral.com)

143 points

by shad427 hours ago |

upvote

by buryat5 hours ago|

[-]

I just wrote a tool for reducing logs for LLM analysis (https://github.com/ascii766164696D/log-mcp)

Lots of logs contain non-interesting information so it easily pollutes the context. Instead, my approach has a TF-IDF classifier + a BERT model on GPU for classifying log lines further to reduce the number of logs that should be then fed to a LLM model. The total size of the models is 50MB and the classifier is written in Rust so it allows achieve >1M lines/sec for classifying. And it finds interesting cases that can be missed by simple grepping

I trained it on ~90GB of logs and provide scripts to retrain the models (https://github.com/ascii766164696D/log-mcp/tree/main/scripts)

It's meant to be used with Claude Code CLI so it could use these tools instead of trying to read the log files

reply

upvote

by aluzzardi4 hours ago|

[-]

Mendral co-founder here and author of the post.

This is an interesting approach. I definitely agree with the problem statement: if the LLM has to filter by error/fatal because of context window constraints, it will miss crucial information.

We took a different approach: we have a main agent (opus 4.6) dispatching "log research" jobs to sub agents (haiku 4.5 which is fast/cheap). The sub agent reads a whole bunch of logs and returns only the relevant parts to the parent agent.

This is exactly how coding agents (e.g. Claude Code) do it as well. Except instead of having sub agents use grep/read/tail, they use plain SQL.

reply

upvote

by buryat4 hours ago|

[-]

yeah, I saw Claude Code doing lots of grepping/find and was curious if that approach might miss something in the log lines or if loading small portion of interesting log lines into the context could help. I find frequently that just looking at ERROR/WARN lines is not enough since some might not actually be errors and some other skipped log lines might have something to look into.

And I just wanted to try MCP tooling tbh hehe Took me 2 days to create this to be honest

reply

upvote

by aluzzardi4 hours ago|

[-]

From our experience running this, we're seeing patterns like these:

- Opus agent wakes up when we detect an incident (e.g. CI broke on main)

- It looks at the big picture (e.g. which job broke) and makes a plan to investigate

- It dispatches narrowly focused tasks to Haiku sub agents (e.g. "extract the failing log patterns from commit XXX on job YYY ...")

- Sub agents use the equivalent of "tail", "grep", etc (using SQL) on a very narrow sub-set of logs (as directed by Opus) and return only relevant data (so they can interpret INFO logs as actually being the problem)

- Parent Opus agent correlates between sub agents. Can decide to spawn more sub agents to continue the investigation

It's no different than what I would do as a human, really. If there are terabytes of logs, I'm not going to read all of them: I'll make a plan, open a bunch of tabs and surface interesting bits.

reply

upvote

by prescriptivist3 hours ago|

[-]

I have an agent system analyzing time series data periodically. What I've landed on is the tools themselves pre-process time series data, giving it more semantic meaning. AKA converting timestamps to human dates, additionally preprocessing it with statistical analysis, such as calculating current windows min/mean/max value for the series as well as a the same for a trailing window and surfacing those in the data. Also adding a volatility score, and doing things like collapsing runs of similar series that aren't particularly interesting from a volatility perspective and just trying to highlight anomalous series in the window in various ways.

This isn't anything new. It's not particularly technical or novel in any way, but it seems to work pretty well for identifying anomalies and comparing series over time horizons. It's even less token efficient on small windows than piping in a bunch of json, but it seems to be more effective from an analysis point of view.

The strange thing about it is that it involves fairly deterministic analysis before we even send the data to the LLM, so one might ask, what's the point if you're already doing analysis? The answer is that LLMs can actually find interesting patterns across a lot of well presented data, and they can pick up on patterns in a way that feels like they are cross-referencing many different time series and correlate signals in interesting ways. That's where the general purpose LLMs are helpful in my experience.

Breaking out analysis into sub-agents is a logical next step, we just haven't gotten there yet.

And yeah the goal is to approximate those of us engineers who are good at RCAs in the moment, who have instincts about the system and can juggle a bunch of tabs and cross reference the signals in them.

reply

upvote

by azinman23 hours ago|

[-]

So how can this be a company when it’s just what Claude code already does?

reply

upvote

by almosthere1 hours ago|

[-]

You may want to also have your agents write small scripts that auto flag future logs.

Have an array of scripts to run against each log (just rust code probably for speed) and have them flag for performance, errors, intrusions, etc...

reply

upvote

by ManuelKiessling4 hours ago|

[-]

https://github.com/dx-tooling/platform-problem-monitoring-co... could have a useful approach, too: it finds patterns in log lines and gives you a summary in the sense of „these 500 lines are all technically different, but they are all saying the same“.

reply

upvote

by buryat4 hours ago|

[-]

the patter matcher is interesting to also collapse log lines and compare that between runs, thank you!

In my tool I was going more of a premise that it's frequently difficult to even say what you're looking for so I wanted to have some step after reading logs to say what should be actually analyzed further which naturally requires to have some model

reply

upvote

by jcgrillo4 hours ago|

[-]

Do you think it could do anything interesting with a highly compressed representation? CLP can apparently achieve 169x compression ratio:

https://github.com/y-scope/clp

https://www.uber.com/blog/reducing-logging-cost-by-two-order...

reply

upvote

by buryat4 hours ago|

[-]

interesting approach, thanks for directing me!

Since the classifier would need to have access to the whole log message I was looking into how search is organized for the CLP compression and see that:

> First, recall that CLP-compressed logs are searchable–a user query will first be directed to dictionary searches, and only matching log messages will be decompressed.

so then yeah it can be combined with a classifier as they get decompressed to get a filtered view at only log lines that should be interesting.

The toughest part is still figuring out what does "interesting" actually mean in this context and without domain knowledge of the logs it would be difficult to capture everything. But I think it's still better than going through all the logs post searching.

reply

upvote

by jcgrillo3 hours ago|

[-]

I like the idea of SQL as the "common tongue" because provided the query is reasonably terse it's easy for the human to verify and reason about, there's shitloads of it in the LLM's training set, and (usually) the database doesn't lie. So you've mitigated some major LLM drawbacks that way.

Another thing SQL has in it's favor is the ability with tools like trino or datafusion to basically turn "everything" into a table.

EDIT: thinking on it some more, though, at what point do you just know off the top of your head the small handful of SQL queries you regularly use and just skip the expensive LLM step altogether? Like... that's the thing that underwhelms me about all the "natural language query" excitement. We already have a very good, natural language for queries: SQL.

reply

upvote

by chickensong2 hours ago|

[-]

> small handful of SQL queries you regularly use

Give those queries to the LLM and enjoy your sleep while the agent works.

reply

upvote

by jcgrillo34 minutes ago|

[-]

hell yeah, give it the ssh keys to and sleep all the time

reply

upvote

by sollewitt6 hours ago|

[-]

But does it work? I’ve used LLMs for log analysis and they have been prone to hallucinate reasons: depending on the logs the distance between cause and effects can be larger than context, usually we’re dealing with multiple failures at once for things to go badly wrong, and plenty of benign issues throw scary sounding errors.

reply

upvote

by aluzzardi6 hours ago|

[-]

Post author here.

Yes, it works really well.

1) The latest models are radically better at this. We noticed a massive improvement in quality starting with Sonnet 4.5

2) The context issue is real. We solve this by using sub agents that read through logs and return only relevant bits to the parent agent’s context

reply

upvote

by hinkley4 hours ago|

[-]

So you’re not getting alerts at 2 am from hallucinations?

reply

upvote

by sollewitt6 hours ago|

[-]

I would be very interested in reading about this kind of orchestration and filtering than data acquisition if you have the energy for another post :)

reply

upvote

by shad426 hours ago|

[-]

We started writing very recently: https://www.mendral.com/blog - there is a another post we made yesterday about the overall architecture. And we have a long list of things we're planning to write about in more details.

Taking good note of your comment :)

reply

upvote

by aluzzardi4 hours ago|

[-]

We've actually started to gather metrics this week to write that exact post :) Coming soon!

reply

upvote

by huflungdung5 hours ago|

[-]

[dead]

reply

upvote

by cgfjtynzdrfht3 hours ago|

[-]

[dead]

reply

upvote

by shad426 hours ago|

[-]

Mendral co-founder here, we built this infra to have our agent detect CI issues like flaky tests and fix them. Observing logs are useful to detect anomalies but we also use those to confirm a fix after the agent opens a PR (we have long coding sessions that verifies a fixe and re-run the CI if needed, all in the same agent loop).

So yes it works, we have customers in production.

reply

upvote

by verdverm6 hours ago|

[-]

It can, like all the other tasks, it's not magic and you need to make the job of the agent easier by giving it good instructions, tools, and environments. It's exactly the same thing that makes the life of humans easier too.

This post is a case study that shows one way to do this for a specific task. We found an RCA to a long-standing problem with our dev boxes this week using Ai. I fed Gemini Deep Research a few logs and our tech stack, it came back with an explanation of the underlying interactions, debugging commands, and the most likely fix. It was spot on, GDR is one of the best debugging tools for problems where you don't have full understanding.

If you are curious, and perhaps a PSA, the issue was that Docker and Tailscale were competing on IP table updates, and in rare circumstances (one dev, once every few weeks), Docker DNS would get borked. The fix is to ignore Docker managed interfaces in NetworkManager so Tailscale stops trying to do things with them.

reply

upvote

by sollewitt6 hours ago|

[-]

Thanks - that’s the maddening with flakes - is it the thing under test or the thing doing the testing? Hermeticity is a lie we tell ourselves :)

reply

upvote

by kburman5 hours ago|

[-]

Honestly, with recent models, these types of tasks are very much possible. Now it mostly depends on whether you are using the model correctly or not.

reply

upvote

by PaulHoule4 hours ago|

[-]

My first take is that you could have 10 TB of logs with just a few unique lines that are actually interesting. So I am not thinking "Wow, what impressive big data you have there" but rather "if you have an accuracy of 1-10^-6 you are still are overwhelmed with false positives" or "I hope your daddy is paying for your tokens"

reply

upvote

by aluzzardi4 hours ago|

[-]

Mendral co-founder and post author here.

I agree with your statement and explained in a few other comments how we're doing this.

tldr:

- Something happens that needs investigating

- Main (Opus) agent makes focused plan and spawns sub agents (Haiku)

- They use ClickHouse queries to grab only relevant pieces of logs and return summaries/patterns

This is what you would do manually: you're not going to read through 10 TB of logs when something happens; you make a plan, open a few tabs and start doing narrow, focused searches.

reply

upvote

by jcgrillo4 hours ago|

[-]

Yeah this is my experience with logs data. You only actually care about O(10) lines per query, usually related by some correlation ID. Or, instead of searching you're summarizing by counting things. In that case, actually counting is important ;).

In this piece though--and maybe I need to read it again--I was under the impression that the LLM's "interface" to the logs data is queries against clickhouse. So long as the queries return sensibly limited results, and it doesn't go wild with the queries, that could address both concerns?

reply

upvote

by NewsaHackO4 hours ago|

[-]

What does O(10) mean?

reply

upvote

by nahumfarchi3 hours ago|

[-]

Mathematically, it means that the number of lines read is bounded by 10*M, where M is some constant. So it's basically equivalent to saying that it's O(1).

I'm guessing that intention was to say "around 10 lines", though it kind of stretches the definition if we're being picky.

reply

upvote

by PaulHoule3 hours ago|

[-]

See https://en.wikipedia.org/wiki/Big_O_notation

reply

upvote

by hansvm3 hours ago|

[-]

I normally see that from engineers using "O(x)" as "approximately x" whenever it's clear from context that you're not actually talking about asymptomatic complexity.

reply

upvote

by jcgrillo3 hours ago|

[-]

I've always thought it was like this, maybe I'm wrong:

O(some constant) -- "nearby" that constant (maybe "order of magnitude" or whatever is contextually convenient)

O(some parameter) -- denotes the asymptotic behavior of some parametrized process

O(some variable representing a small number) -- denotes the negligible part of something that you're deciding you don't have to care about--error terms with exponent larger than 2 for example

reply

upvote

by wizzwizz43 hours ago|

[-]

Those last two notations are, formally, the same. To call a part negligible, we say it's asymptotically bounded above by a constant multiple of this expression, which obviously goes away as we approach the limit. The first one is a colloquial alternative definition that would probably be considered "wrong" in formal writing.

reply

upvote

by jcgrillo3 hours ago|

[-]

Agreed

reply

upvote

by unfunco3 hours ago|

[-]

I think the O means order of magnitude. It looks like Big O notation, but O(10) would collapse to O(1) and OP is not talking about efficiency anyway.

reply

upvote

by PaulHoule3 hours ago|

[-]

"about 10"

reply

upvote

by gabeh52 minutes ago|

[-]

SQL has always been my favorite "loaded gun" api. If you have a control plane of RLS + role based auth and you've got a data dictionary it is trivial to get to a data explorer chat interaction with an LLM doing the heavy lifting.

reply

upvote

by Yizahi6 hours ago|

[-]

We have an ongoing effort in parsing logs for our autotests to speed up debug. It is vary hard to do, mainly because there is a metric ton of false positives or plain old noise even in the info logs. Tracing the culprit can be also tricky, since an error in container A can be caused by the actual failure in the container B which may in turn depend on something entirely else, including hardware problems.

Basically a surefire way to train LLM to parse logs and detect real issues almost entirely depends on the readability and precision of logging. And if logging is good enough then humans can do debug faster and more reliable too :) . Unfortunately people reading logs and people coding them are almost not intersecting in practice and so the issue remains.

reply

upvote

by hinkley4 hours ago|

[-]

I think there’s too many expectations around what logging is for and getting everyone on the same page is difficult.

Meanwhile stats have fewer expectations, and moving signal out of the logs into stats is a much much smaller battle to win. It can’t tell you everything, but what it can tell you is easier to make unambiguous.

Over time I got people to stop pulling up Splunk as an automatic reflex and start pulling up Grafana instead for triage.

reply

upvote

by shad425 hours ago|

[-]

Yeah it sounds very familiar with what we went through while building this agent. We're focused on CI logs for now because we wanted something that works really well for things like flaky tests, but planning to expand the context to infrastructure logs very soon.

reply

upvote

by TKAB1 hours ago|

[-]

That post reads like fully LLM-generated. It's basically boasting a list of numbers that are supposed to sound impressive. If there's a coherent story, it's well hidden.

reply

upvote

by verdverm6 hours ago|

[-]

This is one of those HN posts you share internally in the hopes you can work this into your sprint

reply

upvote

by _boffin_1 hours ago|

[-]

Excited to go through this!

reply

upvote

by the_arun3 hours ago|

[-]

The article doesn't mention about which LLM or total cost. Because if they have used ChatGPT or such, the token cost itself should be very expensive, right?

reply

upvote

by shad422 hours ago|

[-]

There is a cost associated with each investigation (that the Mendral agent is doing). And we spend time tuning the orchestration between agents. Yes expensive but we're making money on top of what it costs us. So far we were able to take the cost down while increasing the relevance of each root cause analysis.

We're writing another post about that specifically, we'll publish it sometimes next week

reply

upvote

by p0w3n3d5 hours ago|

[-]

That's in the contrary to my experience. Logs contain a lot of noise and unnecessary information, especially Java, hence best is to prepare them before feeding them to LLM. Not speaking about wasted tokens too...

reply

upvote

by shad425 hours ago|

[-]

LLMs are better now at pulling the context (as opposed to feeding everything you can inside the prompt). So you can expose enough query primitives to the LLM so it's able to filter out the noise.

I don't think implementing filtering on log ingestion is the right approach, because you don't know what is noise at this stage. We spent more time on thinking about the schema and indexes to make sure complex queries perform at scale.

reply

upvote

by sathish3166 hours ago|

[-]

SQL is the best exploratory interface for LLMs. But, most of Observability data like Metrics, Logs, Traces we have today are hidden in layers of semantics, custom syntax that’s hard for an agent to translate from explore or debug intent to the actual query language.

Large scale data like metrics, logs, traces are optimised for storage and access patterns and OLAP/SQL systems may not be the most optimal way to store or retrieve it. This is one of the reasons I’ve been working on a Text2SQL / Intent2SQL engine for Observability data to let an agent explore schema, semantics, syntax of any metrics, logs data. It is open sourced as Codd Text2SQL engine - https://github.com/sathish316/codd_query_engine/

It is far from done and currently works for Prometheus,Loki,Splunk for few scenarios and is open to OSS contributions. You can find it in action used by Claude Code to debug using Metrics and Logs queries:

Metric analyzer and Log analyzer skills for Claude code - https://github.com/sathish316/precogs_sre_oncall_skills/tree...

reply

upvote

by mr-karan4 hours ago|

[-]

Agreed on SQL being the best exploratory interface for agents. I've been building Logchef[1], an open-source log viewer for ClickHouse, and found the same thing — when you give an LLM the table schema, it writes surprisingly good ClickHouse SQL. I support both a simpler DSL (LogchefQL, compiles to type-aware SQL on the backend) and raw SQL, and honestly raw SQL wins for the agent use case — more flexible, more training data in the corpus.

I took this a few steps further beyond the web UI's AI assistant. There's an MCP server[2] so any AI assistant (Claude Desktop, Cursor, etc.) can discover your log sources, introspect schemas, and query directly. And a Rust CLI[3] with syntax highlighting and `--output jsonl` for piping — which means you can write a skill[4] that teaches the agent to triage incidents by running `logchef query` and `logchef sql` in a structured investigation workflow (count → group → sample → pivot on trace_id).

The interesting bit is this ends up very similar to what OP describes — an agent that iteratively queries logs to narrow down root cause — except it's composable pieces you self-host rather than an integrated product.

[1] https://github.com/mr-karan/logchef

[2] https://github.com/mr-karan/logchef-mcp

[3] https://logchef.app/integration/cli/

[4] https://github.com/mr-karan/logchef/tree/main/.agents/skills...

reply

upvote

by testbjjl5 hours ago|

[-]

> SQL is the best exploratory interface for LLMs

Any qualifiers here from your experience or documentation?

reply

upvote

by shad425 hours ago|

[-]

From own experience it's true, and I think it's due to the amount of SQL content (docs, best practices, code) that you can find online, which is now in all LLM's corpus data.

Same applies when picking a programming language nowadays.

reply

upvote

by pphysch5 hours ago|

[-]

"Logs" is doing some heavy lifting here. There's a very non-trivial step in deciding that a particular subset and schema of log messages deserves to be in its own columnar data table. It's a big optimization decision that adds complexity to your logging stack. For a narrow SaaS product that is probably a no-brainer.

I would like to see this approach compared to a more minimal approach with say, VictoriaLogs where the LLM is taught to use LogsQL, but overall it's a more "out of the box" architecture.

reply

upvote

by masterj4 hours ago|

[-]

> There's a very non-trivial step in deciding that a particular subset and schema of log messages deserves to be in its own columnar data table.

IIUC this is addressed with the ClickHouse JSON type which can promote individual fields in unstructured data into its own column: https://clickhouse.com/blog/a-new-powerful-json-data-type-fo...

Parquet is getting a VARIANT data type which can do the same thing (called "shredding") but in a standards-based way: https://parquet.apache.org/blog/2026/02/27/variant-type-in-a...

reply

upvote

by dbreunig6 hours ago|

[-]

Check out “Recursive Language Models”, or RLMs.

I believe this method works well because it turns a long context problem (hard for LLMs) into a coding and reasoning problem (much better!). You’re leveraging the last 18 months of coding RL by changing you scaffold.

reply

upvote

by koakuma-chan6 hours ago|

[-]

This seems really weird to me. Isn't that just using LLMs in a specific way? Why come up with a new name "RLM" instead of saying "LLM"? Nothing changes about the model.

reply

upvote

by vimda5 hours ago|

[-]

RLMs are a new architecture, but you can mimic an RLM by providing the context through a tool, yes

reply

upvote

by anonymousd3vil3 hours ago|

[-]

New architecture to building agent, but not the model itself. You still have LLMs, but you kinda give this new agentic loop with a REPL environment where the LLM can try to solve the problem more programmatically.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by esafak3 hours ago|

[-]

Forgive me if this is tangential to the debate, but I am trying to understand Mendral's value proposition. Is it that you save users time in setting up observability for CI? Otherwise could you not simply use gh to fetch the logs, their observability system's API or MCP, and cross check both against the code? Or is there a machine learning system that analyzes these inputs beyond merely retrieving context for the LLM? Good luck!

reply

upvote

by shad422 hours ago|

[-]

Mendral is replacing a human Platform Engineer. It debugs the CI logs, look at the commit associated, look at the implementation of the tests, etc... It then proposes fixes and takes care of opening a PR.

We wrote about how this works for PostHog: https://www.mendral.com/blog/ci-at-scale

reply

upvote

by iririririr3 hours ago|

[-]

am i reading correctly that the compression is just a relational records? i.e. omit the pr title, just point to it?

reply

upvote

by aluzzardi3 hours ago|

[-]

There are 2 layers of compression:

- ZSTD (actual data compression)

- De-duplication (i.e. what you're saying)

Although AFAIK it's not "just point to it" but rather storing sorted data and being able to say "the next 2M rows have the same PR Title"

reply

upvote

by tehjoker4 hours ago|

[-]

Interesting article, but there's no rate of investigation success quoted. The engineering is interested, but it's hard to know if there was any point without some kind of measure of the usefulness.

reply

upvote

by shad421 hours ago|

[-]

We did not want to make the post engineering-focused, but we have 18 companies in production today (we wrote about PostHog in the blog). At some point we should post some case studies. The metric we track for usefulness is our monthly revenue :)

reply

upvote

by kingjimmy4 hours ago|

[-]

[flagged]

reply

upvote

by truth_seeker4 hours ago|

[-]

Even if TOP 250 npm packages are refactored through AI coding agent from security, performance and user friendly API point of view, the whole JS ecosystem will be in different shape.

Same is applicable for other language community, of course

reply

upvote

by whoami40416 hours ago|

[-]

"LLMs are good at SQL" is quite the assertion. My experience with LLM generated SQL in OLTP and OLAP platforms has been a mixed bag. IMO analytics/SQL will always be a space that needs a significant weight of human input and judgement in generating. Probably always will be due to the critical business decisions that can be made from the insights.

reply

upvote

by shad426 hours ago|

[-]

What we learned while building this is every token matters in the context, we spend lot of time watching logs of agent sessions, changing the tool params, errors returned by tools, agent prompts, etc...

We noticed for example the importance of letting the model pull from the context, instead of pushing lots of data in the prompt. We have a "complex" error reporting because we have to differentiate between real non-retryable errors and errors that teach the model to retry differently. It changes the model behavior completely.

Also I agree with "significant weight of human input and judgement", we spent lots of time optimizing the index and thinking about how to organize data so queries perform at scale. Claude wasn't very helpful there.

reply

upvote

by whoami40415 hours ago|

[-]

Very interesting work here, no doubt. It's a measured approach to using an LLM with SQL rather than trying to make it responsible for everything end-to-end.

reply

upvote

by SignalStackDev4 hours ago|

[-]

[dead]

reply

upvote

by blharr5 hours ago|

[-]

"LLMs are good at [task I'm not good enough at to tell the LLM is bad at]" is becoming common

reply

upvote

by dylan6046 hours ago|

[-]

> IMO analytics/SQL will always be a space that needs a significant weight of human input and judgement in generating.

Isn't that precisely what is done when prompting?

reply

upvote

by whoami40415 hours ago|

[-]

The key to my point is in the word "generating". Meaning human input/judgement by actually typing more SQL than the LLM produces. The model's reasoning and code generation pipelines are typically 2 separate code paths, so it may not always actually do what it intends which can lead to unexpected results.

reply

upvote

by aluzzardi4 hours ago|

[-]

> My experience with LLM generated SQL in OLTP and OLAP platforms has been a mixed bag

Models are evolving fast. If your experience is older than a few months, I encourage you to try again.

I mean this with the best intentions: it's seriously mind boggling. We started doing this with Sonnet 4.0 and the relevance was okay at best. Then in September we shifted to Sonnet 4.5 and it's been night and day.

Every single model released since then (Opus 4.5, 4.6) has meaningfully improved the quality of results

reply

upvote

by whoami40414 hours ago|

[-]

I totally agree. However, none of them are infallible and never will be. They're nondeterministic by nature. There is an interesting psychological nuance that I've noticed even in myself that comes with AI assistance in coding, and that's the review/approval fatigue. The model could be chugging along happily for hours and make a sudden, terrific error in the 10th hour after you've been staring at reasoning and logs endlessly. The risk of missing the terrific error in that moment is very high at the tail end of the session. The point I was making (poorly) is that in this specific domain, where businesses are making data-driven decisions on output and insights that can determine the trajectory of the entire organization, human involvement is more critical than, say, writing something like a python function with an LLM.

reply

upvote

by shad423 hours ago|

[-]

I agree, we automated in the Mendral agent what is time consuming for human (like debugging a flaky test), but it will need permission to confirm the remediation and open a PR.

But it's night and day to fix your CI when someone (in this case an agent) already dug into the logs, the code of the test and propose options to fix. We have several customers asking us to automate the rest (all the way to merge code), but we haven't done it for the reasons you mention. Although I am sure we'll get there sometimes this year.

reply

upvote

by whoami40412 hours ago|

[-]

Shameless plug here for Lexega—a deterministic policy enforcement layer for SQL in CI/CD :) https://lexega.com

There are bridges here that the industry has yet to figure out. There is absolutely a place for LLMs in these workflows, and what you've done here with the Mendral agent is very disciplined, which is, I'd venture to say, uncommon. Leadership wants results, which presses teams to ship things that maybe shouldn't be shipped quite yet. IMO the industry is moving faster than they can keep up with the implications.

reply

upvote

by kikki5 hours ago|

[-]

Unrelated; what does "mendral" mean? It's a very... unmemorable word

reply

upvote

by shad425 hours ago|

[-]

I am sure you heard before: there are only two hard things in CS: cache invalidation and naming things.

In the history of this company, I can honestly say that this SQL/LLM thing wasn't the hardest :)

reply

upvote

by HanClinto4 hours ago|

[-]

And the other of the two problems is off-by-one errors.

reply

upvote

by aplomb10264 hours ago|

[-]

[dead]

reply

upvote

by octoclaw4 hours ago|

[-]

[dead]

reply

upvote

by aichen_dev4 hours ago|

[-]

[dead]

reply

upvote

by yellow_lead5 hours ago|

[-]

Why the editorialization of the title? "LLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs."

reply

upvote

by dang4 hours ago|

[-]

I don't think we (mods) did that one, but I do like it, because the original title would provoke many comments reacting only to the "LLMs are good at SQL" claim in the title, reducing discussion of the actual post. The comments do have some of this, but it would be worse if that bit were also in the title.

(In that way you can see the title edit as conforming to the HN guideline: ""Please use the original title, unless it is misleading or linkbait; don't editorialize."" under the "linkbait" umbrella. - https://news.ycombinator.com/newsguidelines.html)

reply

upvote

by THESMOKINGUN5 hours ago|

[-]

[flagged]

reply

upvote

by hal9000xbot6 hours ago|

[-]

[flagged]

reply

upvote

by emp173446 hours ago|

[-]

I looked through this users comment history. This is pretty obviously a bot.

reply

upvote

by IncreasePosts6 hours ago|

[-]

Well it's right in the name. Sometimes you just have to take it at face value

reply

upvote

by TheRealPomax3 hours ago|

[-]

Title tells us nothing: what's the tl;dr?

reply