upvote
>you take a spec and create tests, every little thing

A spec detailed enough and unambiguous enough to be translated into machine execution deterministically is called code.

Unlike a compiler, AI can build with a spec that is not detailed enough or unambiguous enough: It does so by filling in the gaps with educated guesses.

This is safe if and only if you take the time to later read the output, understand what its guesses were, and judge wether they were acceptable. No AI can do this for you because the truth lies in your original intentions, which it does not have access to.

The jury is out there on how reliable and time consuming this is vs writing the code yourself; it is not immediately obvious that is faster or requires a smaller cognitive load.

reply
Code is not a spec. It's an instruction set. It can be a spec if you try hard but that's not an inherent property of code. For example you can write code to be a compiler..that makes it a spec. But hello world is not a spec.

As for whether or not LLMs can write unit tests. The answer is yes.

reply
Hello world is a spec. The spec says to produce the text hello world on standard output.
reply
Try running it without a compatible ABI. See how far you get.
reply
Not sure what the point is. We can update the spec with "in the presence of a compatible ABI".
reply
All I'm saying is a program isn't VHS. It's a VHS tape. At that point it's largely philosophy. Can you reconstruct a VHS format from a VHS tape? Sure.
reply
For non trivial uses it wouldn't be a great spec. But I think we can bring our worlds together with a bit of boilerplate.

> The system shall have behavior identical to that expressed by the system created by the following source code. [add some stuff about environment to taste]

reply
If each step requires micro-steps iterating with an LLM with human review to prevent hallucinations creeping in.. at some point you might just be better off letting the human do the work.

Particularly as tokenmaxxing has ended and people are being charged more economic prices. If the pricing 5-10x the way Uber,etc did on the path to profitability.. even more so.

reply
IME, regulatory compliance is something you are rarely able to test for in a nice little box or with well-known suite. So there's no easy "this complies" in many situations, no matter how many lawyers, compliance officers, and llm's you run it past.
reply
so, whats the difference to human engineering?

other than there are "internal micro feedback loops" during development?

reply
I walked down that path for a few months. The more you constrain LLM's, the more underhanded they behave in order to produce something that satisfies all the constraints.

Doing the above doesn't actually make the model smarter, so, if it couldn't get to correct code with fewer steps, then the light you see at the end of the tunnel is an oncoming train.

reply
This is such an abstract principle that the principle itself cannot be refuted. The plan sounds fine on paper. "Just iterate bro". But it entirely depends on what rational agents you put into the system. Obviously, if I sub in a 5 year old child everywhere, this loop breaks. Humans and AI, sometimes one is better than the other at certain things, we're still learning.

The only way to test this is to test it out, in real life. Sometimes people see results, sometimes people don't. Note that yes, I am including the entire iteration process - even after iterating, people still don't see results with AI.

I have had both positive and negative experiences with AI, over multi-week projects. But apparently on hackernews, anything positive about AI is proof that AI is superhuman and taking over, and all follies about AI are lies by stupid humans who secretly have psychological dispositions to fear AI. Sometimes the AI genuinely isn't good enough. Are we not allowed to say that now? We might not know why, but it's just the truth.

The other solution is to formally analyze the entire space of possible actions the agent can take a priori. Then yes, you can definitively say whether or not the principle breaks or not. Can you, though? Can you give a formal specification for the space of possible actions for AI and show that your loop never breaks, or breaks less than humans, or any other sensible criteria? If not, then you can't just give an abstract principle and start making inferences from that.

reply
It’s impossible to write a spec that’s not ambiguous , complete and correct in natural languages. Thus prompts will always generate unreliable software.
reply
deleted
reply