undefined

upvote

points

by the_duke6 hours ago |

upvote

by dworks11 minutes ago|

[-]

>I do think there are opportunities in this space, but what I'd like to see is:

>* write text specifications

>* model transforms text into a formal specification

>* then the formal spec is translated into code which can be verified against the spec

This skill does just that: https://github.com/doubleuuser/rlm-workflow

Each stage produces its own output artifact (analysis, implementation plan, implementation summary, etc) and takes the previous phases' outputs as input. The artifact is locked after the stage is done, so there is no drift.

reply

upvote

by onion2k5 hours ago|

[-]

Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)

If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.

reply

upvote

by sensanaty4 hours ago|

[-]

That if at the beginning of your sentence is doing a whole lot of work. Indeed, if we could formally and provably (another extremely loaded word) generate good code that'd be one thing, but proving correctness is one of those basically impossible tasks.

reply

upvote

by xpe2 hours ago|

[-]

> but proving correctness is one of those basically impossible tasks.

To aim for a meeting of the minds... Would you help me out and unpack what you mean so there is less ambiguity? This might be minor terminological confusion. It is possible we have different takes, though -- that's what I'm trying to figure out.

There are at least two senses of 'correctness' that people sometimes mean: (a) correctness relative to a formal spec: this is expensive but doable*; (b) confidence that a spec matches human intent: IMO, usually a messy decision involving governance, organizational priorities, and resource constraints.

Sometimes people refer to software correctness problems in a very general sense, but I find it hard to parse those. I'm familiar with particular theoretical results such as Rice's theorem and the halting problem that pertain to arbitrary programs.

* With tools like {Lean, Dafny, Verus, Coq} and in projects like {CompCert, sel4}.

reply

upvote

by dsr_5 hours ago|

[-]

Let's rephrase:

Since nobody involved actually cares whether the code works or not, it doesn't matter whether it's a different wrong thing each time.

reply

upvote

by brabel2 hours ago|

[-]

You got it completely backwards. The claim is that if the code does exactly what the spec says (which generated tests are supposed to "prove") then the actual code does not matter, even if it's different each time.

reply

upvote

by ModernMech1 hours ago|

[-]

The point they are making is the tests are neither necessary nor sufficient alone to prove the code does exactly what the spec says. Looking at the tests isn't enough to prove anything; as an extreme example, if no one involved looks at the code, then the tests can just be static always passing and you wouldn't know either way whether or not the code matches the spec or not.

If anyone cared enough they could look at the code and see the problem immediately and with little effort, but we're encouraging a world where no one cares enough to put even that baseline effort because *gestures at* the tests are passing. Who cares how wrong the code is and in what ways if all the lights are green?

reply

upvote

by tomtomtom7774 hours ago|

[-]

> If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.

If the spec is so complete that it covers everything, you might as well write the code.

The benefit of writing a spec and having the LLM code it, is that the LLM will fill in a lot of blanks. And it is this filling in of blanks that is non-deterministic.

reply

upvote

by pjmlp4 hours ago|

[-]

> If the spec is so complete that it covers everything, you might as well write the code.

Welcome to the usual offshoring experience.

reply

upvote

by SpaceNoodled5 hours ago|

[-]

That's a huge "if."

reply

upvote

by gentooflux5 hours ago|

[-]

I usually invert those to reduce nesting

reply

upvote

by FrankRay784 hours ago|

[-]

Sure, but where are the formal acceptance tests to validate against?

reply

upvote

by 0-_-03 hours ago|

[-]

Besides, you can deterministically generate bad code, and not deterministically generate good code.

reply

upvote

by __loam5 hours ago|

[-]

The code is what the code does.

reply

upvote

by kennywinker4 hours ago|

[-]

The shoe is what the shoe does.

Except one shoe is made by children in a fire-trap sweatshop with no breaks, and the other was made by a well paid adult in good working conditions.

The ends don’t justify the means. The process of making impacts the output in ways that are subtle and important, but even holding the output as a fixed thing - the process of making still matters, at least to the people making it.

reply

upvote

by raw_anon_11114 hours ago|

[-]

The end is whether the code meets the functional and non functional requirements.

And guess how much shoe companies make who manufacture shoes in sweatshop conditions versus the ones who make artisanal handcrafted shoes?

reply

upvote

by kennywinker4 hours ago|

[-]

Ah yes - we should all strive to maximize shareholder value - triangle shirtwaist be damnned.

Btw in my metaphor, we - the programmers - are the kids in the sweatshop.

reply

upvote

by raw_anon_11114 hours ago|

[-]

If you are a “programmer” you are going to be the kids in the sweatshop. On the enterprise dev side where most developers work, it’s been headed in that direction for at least a decade where it was easy enough to become a “good enough” generic full stack/mobile/web etc dev.

Even on the BigTech side being able to reverse a btree on the whiteboard and having on your resume that you were a mid level developer isn’t enough either anymore

If you look at the comp on that side, it’s also stagnated for decade. AI has just accelerated that trend.

While my job has been at various percentages to produce code for 30 years, it’s been well over a decade since I had to sell myself on “I codez real gud”. I sell myself as a “software engineer” who can go from ambiguous business and technical requirements, deal with politics, XYProblems, etc

reply

upvote

by pjmlp4 hours ago|

[-]

What do you think programmers in offshoring consulting shops are? Sadly.

reply

upvote

by raw_anon_11113 hours ago|

[-]

Exactly. I work in a consulting company as a customer facing staff consultant - highest level - specializing in cloud + app dev. We don’t hire anyone less than staff in the US. Anything lower is hired out of the country.

That’s exactly my point. “Programming” was clearly becoming commoditized a decade ago.

reply

upvote

by uoaei3 hours ago|

[-]

Functional requirements are known knowns.

Out of bounds behavior is sometimes a known unknown, but in the era of generated code is exclusively unknown unknowns.

Good luck speccing out all the unanticipated side effects and undefined behaviors. Perhaps you can prompt the agent in a loop a bnumber of times but it's hard to believe that the brute-force throw-more-tokens-at-it approach has the same level of return as a more attentive audit by human eyeballs.

reply

upvote

by raw_anon_11112 hours ago|

[-]

Are you as a developer 100% able to trust that you didn’t miss anything? Your team if you are a team lead who delegates tasks to other developers? If you outsource non business things like Salesforce integrations etc do you know all of the code they wrote? Your library dependencies? Your infrastructure providers?

reply

upvote

by xpe2 hours ago|

[-]

It seems like ^ and ^^ agree to me. Am I missing something?

reply

upvote

by raw_anon_11111 hours ago|

[-]

I don’t know. I’m making a point that the only people whose sole responsibility is code that they personally write are mid level ticket takers.

I don’t review every line of code by everyone whose output I’m responsible for, I ask them to explain how they did things and care about their testing, the functional and non functional requirements and hotspots like concurrency, data access patterns, architectural issues etc.

For instance, I haven’t done web development since 2002 except for a little copy and paste work. I completely vibe coded three internal web admin sites for separate projects and used Amazon Cognito for authentication. I didn’t look at a line of code that AI generated any more than I would have looked at a line of code for a website I delegated to the web developer. I cared about functionality and UX.

reply

upvote

by pjmlp4 hours ago|

[-]

Yet the people voting with their wallets seem to go with cheaper option, regardless of what hides behind it.

Being shoes, offshoring, Webwidgets or AI generated code.

reply

upvote

by kennywinker37 minutes ago|

[-]

Sure. People go for the cheapest option that fits their requirements, mostly.

But we’re the shoemakers, not the consumers. It’s actually our job to preserve our own and our peers quality of life.

Cheapest good option possible doesn’t have to be the sweatshop - tho the shareholders of nike or zara would have you believe that.

reply

upvote

by Copyrightest5 hours ago|

[-]

[dead]

reply

upvote

by jrm45 hours ago|

[-]

I would be very comfortable with - re-run 100 times with different seeds. If the outcome is the same every time, you're reliably good to go.

reply

upvote

by SpaceNoodled1 hours ago|

[-]

Even when it's wrong each time?

reply

upvote

by onion2k48 minutes ago|

[-]

If it's wrong then it's not provably correct (for any value of 'proof').

How you define your proof is up to you. It might be a simple test, or an exhaustive suite of tests, or a formal proof. It doesn't matter. If the output of the code is correct by your definition, then it doesn't matter what the underlying code actually is.

reply

upvote

by abreslav1 hours ago|

[-]

> * model transforms text into a formal specification

formal specification is no different from code: it will have bugs :)

There's no free lunch here: the informal-to-formal transition (be it words-to-code or words-to-formal-spec) comes through the non-deterministic models, period.

If we want to use the immense power of LLMs, we need to figure out a way to make this transition good enough

reply

upvote

by pron4 hours ago|

[-]

If what you're after is determinism, then your solution doesn't offer it. Both the formal specification and the code generated from it would be different each time. Formal specifications are useful when they're succinct, which is possible when they specify at a higher level of abstraction than code, which admits many different implemementations.

reply

upvote

by vidarh3 hours ago|

[-]

The point would presumably be to formalise it, then verify that the formal version matches what you actually meant. At which point you can't/shouldn't regenerate it, but you can request changes (which you'd need to verify and approve).

reply

upvote

by pron3 hours ago|

[-]

But the code produced from the formal spec would still be nondeterministic. And I believe CodeSpeak doesn't wish to regenerate the entire program with each spec change, but apply code changes based on the changes to the spec. Maybe there could be other benefits to formalisation in this case, but determinism isn't one of them.

reply

upvote

by nyrikki1 hours ago|

[-]

Even with classic compilation, it is only the semantic behavior that is preserved.

What the Church–Rosser property/confluence is in term rewriting in lambda calculus is a possible lens.

To have a formally verified spec, one has to use some decidable fragment of FO.

If you try to replace code generation with rewriting things can get complicated fast.[2]

Rust uses affine types as an example and people try to add petri-nets[0] but in general petri-net reachability is Ackerman-complete [1]

It is just the trade off of using a context free like system like an LLM with natural language.

HoTT and how dependent types tend to break isomorphic ≃ equal Is another possible lens.

[0] https://arxiv.org/abs/2212.02754v3

[1] https://arxiv.org/abs/2212.02754v3

[2] https://arxiv.org/abs/2407.20822

reply

upvote

by vidarh3 hours ago|

[-]

It doesn't matter if the code is different if the spec is formal enough to validate the software against it.

I have no idea about codespeak - I was responding to the comments above, not about codespeak.

reply

upvote

by pron2 hours ago|

[-]

Validating programs against a formal spec is very, very hard for foundational computational complexity reasons. There's a reason why the largest programs whose code was fully verified against a formal spec, and at an enormous cost, were ~10KLOC. If you want to do it using proofs, then lines of proof outnumber lines of code 10-1000 to 1, and the work is far harder than for proofs in mathematics (that are typically much shorter). There are less absolute ways of checking spec conformance at some useful level of confidence, and they can be worthwhile, but they require expertise and care (I'm very much in favour of using them, but the thought that AI can "just" prove conformance to a formal spec ignores the computational complexity results in that field).

reply

upvote

by vidarh2 hours ago|

[-]

For most cases we don't need nearly that comprehensive verification. This is expecting more off AI written code than we ever bother to subject most human written code to. There's a vast chasm there we only need to even slightly start to bridge to get to far higher confidence levels than the typical human dev team achieves.

reply

upvote

by davedx6 hours ago|

[-]

My process has organically evolved towards something similar but less strictly defined:

- I bootstrap AGENTS.md with my basic way of working and occasionally one or two project specific pieces

- I then write a DESIGN.md. How detailed or well specified it is varies from project to project: the other day I wrote a very complete DESIGN.md for a time tracking, invoice management and accounting system I wanted for my freelance biz. Because it was quite complete, the agent almost one-shot the whole thing

- I often also write a TECHNICAL-SPEC.md of some kind. Again how detailed varies.

- Finally I link to those two from the AGENTS. I also usually put in AGENTS that the agent should maintain the docs and keep them in sync with newer decisions I make along the way.

This system works well for me, but it's still very ad hoc and definitely doesn't follow any kind of formally defined spec standard. And I don't think it should, really? IMO, technically strict specs should be in your automated tests not your design docs.

reply

upvote

by the_duke5 hours ago|

[-]

I think many have adopted "spec driven development" in the way you describe.

I found it works very well in once-off scenarios, but the specs often drift from the implementation. Even if you let the model update the spec at the end, the next few work items will make parts of it obsolete.

Maybe that's exactly the goal that "codespeak" is trying to solve, but I'm skeptical this will work well without more formal specifications in the mix.

reply

upvote

by dworks7 minutes ago|

[-]

You need to lock the specs and implementation plan and verify the implementation about the previous phase docs.

https://github.com/doubleuuser/rlm-workflow

reply

upvote

by intrasight5 hours ago|

[-]

> specs often drift from the implementation > Maybe that's exactly the goal that "codespeak" is trying to solve

Yes and yes. I think it's an important direction in software engineering. It's something that people were trying to do a couple decades ago but agentic implementation of the spec makes it much more practical.

reply

upvote

by jbonatakis5 hours ago|

[-]

I have been building this in my free time and it might be relevant to you: https://github.com/jbonatakis/blackbird

I have the same basic workflow as you outlined, then I feed the docs into blackbird, which generates a structured plan with task and sub tasks. Then you can have it execute tasks in dependency order, with options to pause for review after each task or an automated review when all child task for a given parents are complete.

It’s definitely still got some rough edges but it has been working pretty well for me.

reply

upvote

by rebolek5 hours ago|

[-]

AGENTS.md is nice but I still need to remind models that it exists and they should read it and not reinvent the wheel every time.

reply

upvote

by allthetime5 hours ago|

[-]

There should be a setting to include specific files in every prompt/context. I’m using zed and when you fire up an agent / chat it explicitly states that the file(s) are included.

reply

upvote

by DrJokepu5 hours ago|

[-]

> Models aren't deterministic

Is that really true? I haven’t tried to do my own inference since the first Llama models came out years ago, but I am pretty sure it was deterministic: if you fixed the seed and the input was the same, the output of the inference was always exactly the same.

reply

upvote

by bigwheels5 hours ago|

[-]

LLMs are not deterministic:

1.) There is typically a temperature setting (even when not exposed, most major providers have stopped exposing it [esp in the TUIs]).

2.) Then, even with the temperature set to 0, it will be almost deterministic but you'll still observe small variations due to the limited precision of float numbers.

Edit: thanks for the corrections

reply

upvote

by dwohnitmok4 hours ago|

[-]

> but you'll still observe small variations due to the limited precision of float numbers

No. Floating number arithmetic is deterministic. You don't get different answers for the same operations on the same machine just because of limited precision. There are reasons why it can be difficult to make sure that floating point operations agree across machines, but that is more of a (very annoying and difficult to make consistent) configuration thing than determinism.

(In general it is mildly frustrating to me to see software developers treat floating point as some sort of magic and ascribe all sorts of non-deterministic qualities to it. Yes floating point configuration for consistent results across machines can be absurdly annoying and nigh-impossible if you use transcendental functions and different binaries. No this does not mean if your program is giving different results for the same input on the same machine that this is a floating point issue).

In theory parallel execution combined with non-associativity can cause LLM inference to be non-deterministic. In practice that is not the case. LLM forward passes rarely use non-deterministic kernels (and these are usually explicitly marked as such e.g. in PyTorch).

You may be thinking of non-determinism caused by batching where different batch sizes can cause variations in output. This is not strictly speaking non-determinism from the perspective of the LLM, but is effectively non-determinism from the perspective of the end user, because generally the end user has no control over how a request is slotted into a batch.

reply

upvote

by comboy4 hours ago|

[-]

Limited precision of float numbers is deterministic. But there's whole parallelism and how things are wired together, your generation may end up on a different hardware etc.

And models I work with (claude,gemini etc) have the temperature parameter when you are using API.

reply

upvote

by the_duke3 hours ago|

[-]

You shouldn't be downvoted - LLMs could in theory be deterministic, but they currently are not, due to how models are implemented.

reply

upvote

by otabdeveloper42 hours ago|

[-]

All my self-hosted inference has temperature zero and no randomness.

It is absolutely workable, current inference engines are just lazy and dumb.

(I use a Zobrist hash to track and prune loops.)

reply

upvote

by wenc2 hours ago|

[-]

Rehashing my comment from before:

I use Kiro IDE (≠ Kiro CLI) primarily as a spec generator. In my experience, it's high-quality for creating and iterating on specs. Tools like Cursor are optimized for human-driven vibing -- they have great autocomplete, etc. Kiro, by contrast, is optimized around spec, which ironically has been the most effective approach I've found for driving agents.

I'd argue that Cursor, Antigravity, and similar tools are optimized for human steering, which explains their popularity, while Kiro is optimized for agent harnesses. That's also why it’s underused: it's quite opinionated, but very effective. Vibe-coding culture isn't sold on spec driven development (they think it's waterfall and summarily dismiss it -- even Yegge has this bias), so people tend to underrate it.

Kiro writes specs using structured formats like EARS and INCOSE (which is the spc format used in places like Boeing for engineering reqs). It performs automated reasoning to check for consistency, then generates a design document and task list from the spec -- similar to what Beads does. I usually spend a significant amount of time pressure-testing the spec before implementing (often hours to days), and it pays off. Writing a good, consistent spec is essentially the computer equivalent of "writing as a tool of thought" in practice.

Once the spec is tight, implementation tends to follow it closely. Kiro also generates property-based tests (PBTs) using Hypothesis in Python, inspired by Haskell's QuickCheck. These tests sweep the input domain and, when combined with traditional scenario-based unit tests, tend to produce code that adheres closely to the spec. I also add a small instruction "do red/green TDD" (I learned this from Simon Willison) and that one line alone improved the quality of all my tests. Kiro can technically implement the task list itself, but this is where agents come in. With the spec in hand, I use multiple headless CLI agents in tmux (e.g., Kiro CLI, Claude Code) for implementation. The results have been very good. With a solid Kiro spec and task list, agents usually implement everything end-to-end without stopping -- I haven’t found a need for Ralph loops. (agents sometimes tend to stop mid way on Claude plans, but I've never had that happen with Kiro, not sure why, maybe it's the checklist, which includes PBT tests as gates).

didn't have the strongest start, but the Kiro IDE is one of the best spec generators I've used, and it integrates extremely well with agent-driven workflows.

reply

upvote

by jnpnj3 hours ago|

[-]

Maybe we're entering the non-deterministic applications too. No more mechanical predictable thing.. more like 90% regular and then weird.

Slightly sarcastic but not sure this couldn't become a thing.

reply

upvote

by rco87864 hours ago|

[-]

How is your 2 step process not susceptible to all the exact same pitfalls you listed above?

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by dist-epoch4 hours ago|

[-]

> Models aren't deterministic - every time you would try to re-apply you'd likely get different output

So like when you give the same spec to 2 different programmers.

reply

upvote

by rco87864 hours ago|

[-]

Yes, if you had each programmer rewrite the code from scratch each time you updated the spec.

reply

upvote

by orbital-decay3 hours ago|

[-]

In reality you give the same programmer an update to the existing spec, and they change the code to implement the difference. Which is exactly what the thing in OP is doing, and exactly what should be done. There's simply no reason to regenerate the result.

The entire thing about determinism is a red herring, because 1) it's not determinism but prompt instability, and 2) prompt instability doesn't matter because of the above. Intelligence (both human and machine) is not a formal domain, your inputs lack formal syntax, and that's fine. For some reason this basic concept creates endless confusion everywhere.

reply

upvote

by dboreham3 hours ago|

[-]

Also like this: https://codeassociates.github.io/conversations-with-claude/c...

reply

upvote

by kennywinker4 hours ago|

[-]

Except each time you compile your spec you’re re-writing it from scratch with a different programmer.

reply

upvote

by pessimizer5 hours ago|

[-]

I think your objections miss the point. My informal specs to a program are user-focused. I want to dictate what benefits the program will give to the person who is using it, which may include requirements for a transport layer, a philosophy of user interaction, or any number of things. When I know what I want out of a program, I go through the agony of translating that into a spec with database schemas, menu options, specific encryption schemes, etc., then finally I turn that into a formal spec within which whether I use an underscore or a dash somewhere becomes a thing that has to be consistent throughout the document.

You're telling me that I should be doing the agonizing parts in order for the LLM to do the routine part (transforming a description of a program into a formal description of a program.) Your list of things that "make no sense" are exactly the things that I want the LLMs to do. I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors.

I want to see specs that move away from describing the specific functionality of programs altogether, and more into describing a usefulness or the convenience of a program that doesn't exist. I want to be able to feed the LLM requirements of what I want a program to be able to accomplish, and let the LLM research and implement the how. I only want to have to describe constraints i.e. it must enable me to be able to do A, B, and C, it must prevent X,Y, and Z; I want it to feel free to solve those constraints in the way it sees fit; and when I find myself unsatisfied with the output, I'll deliver it more constraints and ask it to regenerate.

reply

upvote

by ptak_dev1 hours ago|

[-]

This resonates. I built an AI career consultant (CareerCraft AI) with exactly this philosophy — describe what the user needs to accomplish, not how to implement it.

The spec was essentially: "A user shares their professional background and a target job posting. The AI has a conversation to understand their experience, then generates a tailored resume and cover letter as downloadable PDFs. It remembers their profile within the session so they can generate documents for multiple roles."

That's it. No formal grammar, no structured templates for the conversation flow. The LLM handles the "how" — when to ask clarifying questions, how to reframe experience for different roles, what to emphasize based on the job description.

The interesting finding: natural language specs work remarkably well for applications where the output is also natural language (documents, advice, analysis). The formal spec approach makes more sense when the output needs to be deterministic code.

Try it if you're curious: https://super.myninja.ai/apps/6de082c7-a05f-4fc5-a7d3-ab56cc...

reply

upvote

by darkwater5 hours ago|

[-]

> I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors.

Be careful what you wish for. This sounds great in theory but in practice it will probably mean a migration path for the users (UX changes, small details changed, cost dynamics and a large etc.)

reply

upvote

by jbritton1 hours ago|

[-]

I tried this recently with what I thought was a simple layout, but probably uncommon for CSS. It took an extremely long back and forth to nail it down. It seemingly had no understanding how to achieve what I wanted. A couple sentences would have been clear to a person. Sometimes LLMs are fantastic and sometimes they are brain dead.

reply

upvote

by fnord775 hours ago|

[-]

[delete]

reply

upvote

by koolala4 hours ago|

[-]

It isn't a formal language, look at the goose example:

https://codespeak.dev/blog/greenfield-project-tutorial-20260...

It is a formal "way" aka like using json or xml like tons of people are already doing.

reply

upvote

by dist-epoch4 hours ago|

[-]

Software products specifications are written in real language, not in first order logic.

reply

upvote

by hkonte5 hours ago|

[-]

[dead]

reply