undefined

points

[-]

> if you can use AI to achieve full test coverage, are there actual benefits in static typing for coding agents?

Full test coverage doesn’t tell you if the tests behave correctly. So you could prompt an AI agent to write 100% test coverage where those tests could be exercising all code paths yet contribute 0% to the story of what the code does. You need human understanding of what the desired contract is that the tests check.

Imagine a contract lawyer who blindly signs any contract that they are given: they aren’t doing their job. They ought to have an idea in mind of what their client’s goals and limits are so they can determine if a given contract fulfils those needs.

Types are a declarative contract, so they can be a lighter yet more limited way to enforce a contract. The compiler can verify if all the declared types across the program agree with each other. This is especially helpful with refactoring, such as ensuring the adding a field has been rolled out everywhere.

Types aren’t to be just checked by the compiler, but checked by the human authors too. That’s why explicit type signatures are valuable, especially if they are kept intelligible. They encode the different variations in state and possible branching on that state. So you can whittle your types down as a way of whittling the solution down to be more focused. The problem in your head is reflected in the types, and any simplifications in the types then simplify the problem in your head, and any tests derived from that understanding.

by osener10 hours ago|

prev|

[-]

In my experience restricting programs that can be expressed is a good thing, even more so with agentic engineering. The more guardrails there are, strong typing/TDD/computer use/..., the solution space shrinks and chance of a robust solution increases. Sure maybe this burns more tokens going in circles but it feels less like a slot machine more like a robot searching for a solution for a well-defined problem.

Devs have very strong opinions about dynamically typed programming languages. But reasons such as "exploratory programming", "expressiveness", "taste" that makes them feel good to program in for humans does not matter for agents. Agents don't care that the language "limits them" and prevents them from expressing the code in a succint way because it would not type check.

by josevalim10 hours ago|

parent|

[-]

Agreed on the guardrails bit. My point is that we still don't have much evidence that static types are an effective way to constrain the search space for coding agents, or how much value they add on top of other mechanisms. Redundancy can certainly be beneficial, but how much and at what cost?

On expressiveness, people often frame it as a dynamic-language goal, but a large portion of type system research is precisely about making type systems more expressive so they can describe a wider range of programs and invariants. This is clearly something both camps value. I suppose another interesting benchmark could be: how do coding agents perform across languages with different degrees of type-system expressiveness?

We may directionally agree, but it is hard to draw conclusions without measurements. Overall, I'd say this is much more of an open question than people give it credit for.

by phillmv4 hours ago|

prev|

[-]

>Type systems restrict which programs can be expressed and increasing expressiveness often requires increasing type-system complexity (which, speaking from experience, both humans and agents will struggle with). Plus they are not the only mechanism to assert correctness (they only validate a subset of your program correctness and do not replace tests)

This articulates a lot of my own thinking wrt type systems, speaking as a downstream user without a lot of exposure to prog language theory, and I wish this debate were more often framed in these terms.

Another reply to this comment hinted that it might be more about giving LLMs feedback loops and that to me also seems like a more likely mechanism.

I'm not an elixir user but I've watched it from a distance over the years – thank you for your efforts and your experimentation.

by teleforce4 hours ago|

prev|

[-]

I used to hold similar opinion but D language, and this article by Patrick Li (HN JITX co-founder) who's the original author of little known but very powerful language Stanza changed my mind [1],[2].

He argued that Ruby has enabled a very expressive language that enabled RoR, and when it was originally written other languages are less capable, and accordingly the proof is in the pudding.

In his new language Stanza for his PhD thesis he has designed an optional typed system supporting both typed and untyped, it seems very similar in concept to the OP article that you've written on Elixir. Groovy also deserved a special mention, and the pudding is Grails.

Interestingly both Elixir and Stanza have GC, but Stanza also support non-GC namely LoStanza in which Stanza GC is written.

Interestingly, D language pioneered this combination both GC (by default) and non-GC more seamlessly, even before Stanza.

In addition to Ruby, these four languages namely Elixir, Groovy, Stanza and D all have similar to or better expressive power than Ruby. Notably both Stanza and D are compiled languages. Above all D is an anomaly in a good way since it's a fully type programming language. Kudos to Walter and the team for giving birth to a highly expressive fully typed modern language, very fast in compilation and runtime, truly one of a kind [3].

Regarding the issue of comparatively smaller corpus for these languages as mentioned by others, I think the new self-distillation technique for LLM and code generation as proposed by Apple, MIT-ETH and UCLA can overcome this limitation [4].

[1] “Stop Designing Languages. Write Libraries Instead” (2016) (278 comments):

https://news.ycombinator.com/item?id=46525640

[2] Stanza: People:

https://lbstanza.org/people.html

[3] Origins of the D programming language:

https://dl.acm.org/doi/10.1145/3386323

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

by agentgt3 hours ago|

parent|

[-]

> Groovy also deserved a special mention, and the pudding is Grails.

I vaguely remember that when Groovy became more typed (statically typed that is. I believe you could always put the types in but they were not checked.) there was a theory that it kind of hurt possible uptake of the language.

The reason being is that people felt well if we are adding types and a project is requiring it why don't we just use: Java, Scala, Kotlin etc. Like did Java getting more features or Kotlin coming really hurt Groovy or just that it became more of a typed language.

An analog (typed language stealing users) could happen to Elixer but I'm not really sure which language it would be.

> I think the new self-distillation technique for LLM and code generation as proposed by Apple

Speaking of Apple and eventual typing Dylan was an amazing language that just never got traction. Open Dylan still exists but few know about it. Its eventual typing is unique because Dylan does CLOS-like multimethod dispatch instead of pattern matching.

by kamma44342 hours ago|

parent|

prev|

[-]

> Groovy also deserved a special mention, and the pudding is Grails.

Not sure it is much of a success. Groovy gets unreadable very fast, and the editor won’t help you. Gradle moved to Kotlin, and it’s 10x better in readability and maintainability.

by giraffe_lady5 hours ago|

prev|

[-]

> 2. Do we actually have any data or evaluations that show which typing discipline is better for agents? The only benchmark I am aware of [AutoCodeBenchmark] has Elixir come first (dynamic) and C# as second (static), so it doesn't answer the question. There are other benchmarks that show dynamic languages require fewer tokens to solve problems (but that's not a metric I particularly care about)

I am actually writing a paper on this right now so nothing I can point you to yet but yes. LLMs are better (produce working code in fewer attempts controlling for the relative size of training corpus) when using type systems with inference and global unification. It is largely about the quality of the error feedback channel so languages with very good compiler errors (accurate, localized, include the correction with the failure) can close a lot of ground.

But inference + sound type system gives you a constraint propagation that genuinely restricts the ability of the LLM to get into trouble. Type systems that require annotation give up most of the benefit, since the annotations are themselves surface area for LLM mistakes. Unification also puts heavy limits on the expressiveness of the language which is a confounder and may actually be a big part of the benefit too.

Everyone has been on the "the training data is better" thing but I actually don't think so. All of the languages that people report as being better because of good training data actually have fairly restrictive type systems. Elixir is an exception, but it has exceptionally good error messages! And also, along with erlang, pretty unique runtime semantics that may contribute but that's outside my domain I'm on type systems. Debunking the training quality thing is not what I'm working on but I have deep suspicions about that common wisdom.

by josevalim4 hours ago|

parent|

[-]

That’s very exciting! Is there anywhere I could follow you for updates? If you don’t want to share it publicly, and is ok with sharing it privately, my email is my username on gmail. Thank you!!

by giraffe_lady4 hours ago|

parent|

[-]

We're hoping to have a preprint ready at the end of the month, I'll send it to you then!

by pramodbiligiri6 minutes ago|

parent|

[-]

I'd like to see this paper too! Email in my profile.

Recently I came across this paper that explores the "Design by Contract" technique for LLMs: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=112...

by josevalim3 hours ago|

parent|

prev|

[-]

Lovely!!!

by rspeele2 hours ago|

prev|

[-]

One thing something like AutoCodeBenchmark cannot demonstrate is what happens when you have human-written type definitions defining the domain before the LLM writes a line of code.

That is something I have found very effective in F#, that I model the domain with types, I know what the type signatures of the functions I need are, and the LLM does the work of actually implementing those functions.

Here is a concrete example:

I have been playing around with a program to assist me with projects I make at home on my hobby-grade CNC router, which does not have an automatic toolchanger. I use a mix of Vectric VCarve and some older handwritten programs to generate GCode files. I end up with a USB drive with maybe 6 to 12 GCode files on it and a model in my head of "to make this product, I start with a board here, gotta install this square nose end mill and zero on this corner of the board, run files A and B. Then install a ball nose end mill and run file C. Then flip the board over lengthwise, switch to a smaller square nose end mill, zero here, run file D. etc. etc."

Although I try to name the GCode files in a self documenting way like 01_TopSide_25square.ngc, if I come back in 1 year and want to make the same thing again, I pretty much always have to open VCarve and eyeball what the hell all the files did and confirm where to zero, what size board to use, etc. So I'm making a tool where I can define those human-operator steps that go with the G-Code files, save it as a "project file", preview in 3d what each step will look like, and export to a printable PDF with screenshots and step-by-step instructions. Hopefully this will reduce the amount of rot that these projects suffer and the cognitive overhead of picking up an old one.

Modeling the steps as F# types was the very first step, like (small excerpt):

    type WorkpiecePlacement =
        {   Id : WorkpieceId
            /// Corner of the workpiece we'll attach to the machine.
            WorkpieceCorner : WorkpieceSpace.Corner3D
            /// Point in machine-space we'll anchor this corner to.
            MachinePoint : MachineSpace.Point
            /// Which face of the workpiece is on top.
            FaceUp : WorkpieceSpace.Face
            /// Rotation around the up-axis.
            Yaw : WorkpieceSpace.Yaw
        }

    type OperationType =
        | PlaceWorkpiece of placement : Operation.WorkpiecePlacement
        | InstallTool of id : ToolId * slot : int option
        | ZeroAt of point : MachineSpace.Point
        | RunGCode of source : GCode.Source
        | RemoveWorkpiece of id : WorkpieceId

For the GCode simulator I needed a parser for GCode files, which produces a type with 1:1 equivalence to the GCode instruction set:

    type GCodeInstruction =
        // --- Motion ---
        | G0_RapidMove of axisMoves : (Axis * float<gcodeunit>) array
        | G1_Move of feedRate : float<gcodeunit/minute> option * axisMoves : (Axis * float<gcodeunit>) array
        | G2_ClockwiseArc of ArcParams
        | G3_CounterClockwiseArc of ArcParams
        | G4_Dwell of seconds : double

        // --- Plane selection ---
        | G17_SelectXYPlane
        | G18_SelectXZPlane
        | G19_SelectYZPlane

        // --- Unit selection ---
        | G20_Inches
        | G21_Millimeters

        // --- Distance mode ---
        | G90_AbsoluteDistance
        | G91_RelativeDistance
        // ... etc truncated, more instructions in real code

But my tool supports doing transforms on toolpaths, like rotating 90 degrees or offsetting so I can easily define that I want to make tiling copies of the same project. To implement those transforms straight up as GCodeInstruction[] -> GCodeInstruction[] is a bad call. GCode is very stateful and lets you switch units, relative vs. absolute coordinate spaces, etc. in instructions. That makes the transform awkward and tricky to write.

So I have a ToolPath type that makes the transforms clean. It normalizes the many ways of expressing the same toolpath in GCode to a single representation with all absolute coordinates in metric units.

    type ToolPathInstruction =
        | Rapid of From : Point * To : Point
        | Linear of From : Point * To : Point * Feed : FeedRate
        | Arc of
            From : Point *
            To : Point *
            Center : Point *
            Plane : Plane *
            Direction : ArcDirection *
            Feed : FeedRate
        | ... etc truncated

That is the appropriate level for the transforms like offset, rotate, scale, etc. to operate on.

Yet there is still ANOTHER level of toolpath-related operations that deserves its own type. When I'm doing simulation of material removal to check for crashes, or rendering the toolpath in 3d, I don't want to deal with arcs! The rendering/simulation is inherently an approximation. It will break down each arc into line segments. So sim code and rendering code shouldn't take a toolpath, it should take basically a line segment list, or in other words...

    type ApproxMove =
        {   From : Vector3
            To : Vector3
            FeedRate : double<m/minute>
            IsRapid : bool
        }

    type ToolPathApproximation =
        {   StartPosition : Vector3
            Moves : ApproxMove[]
        }

Having defined all these types it's clear that I need operations like:

    parse: string -> GCode
    serialize : GCode -> string
    normalizeToToolPath : GCode -> ToolPath
    denormalizeToGCode : ToolPath -> GCode
    offset : Vector3 -> ToolPath -> ToolPath
    rotate90 : ToolPath -> ToolPath
    scale : Vector3 -> ToolPath -> ToolPath
    approximate : ToolPath -> ToolPathApproximation
    simulate : ToolPathApproximation -> MachineState -> MachineState
    renderToolPathWireframe : ToolPathApproximation -> VBO
    renderMachineState : MachineState -> VBO

And so on. An LLM is absolutely awesome at one-shotting the implementations.

I would find it quite frustrating trying to model the same domain without any types, either having all methods working on a single toolpathy data structure that's not really the right fit for any of the places it's used, or having them work on multiple data structures without any clear delineation of which layer is expecting which toolpathy-thing that are all subtly but importantly different.