undefined

points

[-]

>You can get workflows that have individual parts that aren't so precise become better by composing them, and letting one component influence the other. Like e2e coding gets better by checking with "gof" tools (linters, compilers, etc). Then it gets even better by adding a coding review stage. Then it gets even better by adding a static analysis phase.

This is the exact point I make whenever people say LLMs aren't deterministic and therefore not useful.

Yes, they are "stochastic". But you can use them to write deterministic tools that create machine readable output that the LLM can use. As you mention, you keep building more of these tools and tying them together and then you have a deterministic "network" of "lego blocks" that you can run repeatably.

by binarymax3 hours ago|

prev|

[-]

I disagree that evaluation is always a coding task. Evaluation is scrutiny for the person who wants the thing. It’s subjective. So, unless you’re evaluating something purely objective, such as an algorithm, I don’t see how a self contained, self “improving “ agent accomplishes the subjectivity constraint - as by design you are leaving out the subject.

by NitpickLawyer3 hours ago|

parent|

[-]

Sure. There will always be subjective tasks where the person who asks for something needs to give feedback. But even there we could come up with ways to make it easier / faster / better ux. (one example I saw my frontend colleagues do is use a fast model to create 9 versions of a component, in a grid. And they "at a glance" decide which one is "better", and use that going forwards).

OTOH, there's loads you can do for evaluation before a human even sees the artifact. Things like does the site load, does it behave the same, did anything major change on the happy path, etc etc. There's a recent-ish paper where instead of classic "LLM as a judge" they used LLMs to come up with rubrics, and other instances check original prompt + rubrics on a binary scale. Saw improvements in a lot of evaluations.

Then there's "evaluate by having an agent do it" for any documentation tracking. Say you have a project, you implement a feature, and document the changes. Then you can have an agent take that documentation and "try it out". Should give you much faster feedback loops.

by ranyume3 hours ago|

parent|

prev|

[-]

In science there are ways to surface subjectivity (cannot be counted) into observable quantized phenomena. Take opinion polls for instance: "approval" of a political figure can mean many things and is subjective, but experts in the field make "approval" into a number through scientific methods. These methods are just an approximation and have many IFs, they're not perfect (and for presidential campaign analysis in particular they've been failing for reasons I won't clarify here), but they're useful nonetheless.

Another thing that get quantized is video preferences to maximize engagement.

by 3 hours ago|

parent|

prev|

[-]

deleted

by sbinnee19 minutes ago|

prev|

[-]

I guess this paper is part of ICML coming soon this June. I hope to see a lot of cool papers.

by lukebuehler3 hours ago|

prev|

[-]

Agree. It's code all the way down. The key is to give agents a substrate where they can code up new capabilities and then compose them meaningfully and safely.

Larger composition, though, starts to run into typical software design problems, like dependency graphs, shared state, how to upgrade, etc.

I've been working on this front for over two years now too: https://github.com/smartcomputer-ai/agent-os/

by whattheheckheck1 hours ago|

parent|

[-]

Oh wow, what do you think of karpathys autoresearch? Feels like this is just that? Gotta openclawify it?

by whattheheckheck1 hours ago|

parent|

prev|

[-]

So what are software packages now a days other than precomputed subsets of capabilities. Like a mesh that data gets pushed through to produce what? What are the optimal subset of prebuilt programs to accomplish any task?

by alansaber2 hours ago|

prev|

[-]

The whole theme of llm dev to date has been "theres more common than not" in llm applications

by testaccount283 hours ago|

prev|

[-]

because submarine piloting is a going-under-water activity, improvements in holding one's breath can lead to faster submersibles.

by Atomic_Torrfisk1 hours ago|

prev|

[-]

Im sorry, this just sounds like hypespeak. CAn you provide samples?

> once they unlock one capability,

What does it mean to unlock? Its an llm nothing is locked. The output is a as good as the context, model and environment. Nothing is hidden or locked.

by IncreasePosts56 minutes ago|

parent|

[-]

Maybe unlock means "recognize and solve a problem with an order of magnitude fewer tokens than the first time you did it". The same way humans might spend a lot of time thinking about a certain problem and various ways to solve it, but once they go through that process, and then recognize it again, they don't need to go to the same process and jump right to the solution.