Data here: https://gertlabs.com/rankings?mode=agentic_coding
I'm working with Clojure which is used mostly by senior engineers and it still blows my mind how well Claude writes software in it even though it's a fringe language. It's even able to pick up in-house DSLs written with macros.
Recently, I had a more pleasant experience using LLMs with Go. It reminds me a bit of Python 2.x, when the community seemed, in my view, more focused on embracing a stupid simple language, with everyone trying to write roughly similar "Pythonic" code.
If there’s one language that is the prime example of this, it’s C++, and according to this benchmark it ranks incredibly high.
I’m also thoroughly confused why Kimi 2.6 scores 83% while Opus 4.7 scores 67% for C++, GPT5.5 isn’t even in the top10.
Gemma 4 31B scores 100% success rate for Python (!!) while Opus 4.6 only 65%.
This benchmark really seems to be all over the place and doesn’t make sense.
Certain popular PHP codebases appear to use a similar methodology.
Not how any of it works.
I also don’t understand how these “games” map to real world complex problems. How are you measuring success? How does “adversarial customer service” map to “this LLM is better at C++ than the other” ? How are you sure you’re not just benchmarking language suitability for a problem ?
I have so many questions about this …
- You need to run evals at scale to converge on this kind of behavior: these benchmarks run samples across a pool of hundreds of different types of environments
- Some games are too open-ended to support code play. The customer service game is an example of that, where models are called on every tick of the environment to make a decision (that's the 'decision making' part of the evals which is weighted lowest). Very interesting results but not testing coding ability, just general reasoning.
Not sure what issues you have with models writing C++ vs other languages, but I can imagine all sorts of C++ specific bottlenecks not directly related to the model's ability to reason in the language, like the dependencies, verbosity, extra effort to manage memory, etc. I have only done a little C/embedded work since agentic coding happened but I was pleasantly surprised.
It seems to present results as if they’re testing language abilities, but the problems seem to be reasoning problems.
It would also be interesting to see how Python compares to other languages in its niche (Ruby, Perl, Raku).
Thanks for putting this together! It's interesting.
Prolog night be interesting because I bet nobody is trying to train very hard on it, but I'm less directly interested in model performance with Prolog.
A relative lack of training data might have a bigger effect though.
a) Typed Racket
b) OCaml
c) Julia
I would love to see those three added to your benchmarks. And Mistral Medium 3.5 added to the LLM list, please.
Mistral Medium 3.5 is on there, but you will have to scroll down pretty far to find it (does not perform well): https://gertlabs.com/rankings?mode=oneshot_coding
Typed Racket is to Racket as TypeScript is to JavaScript: it adds some additional static checks to an otherwise dynamic language via gradual typing. This pair of languages might help begin answer the question "does gradual typing generally help LLMs, or does TypeScript outperform JavaScript for incidental reasons?".
Among Lisps, I'm most interested in seeing Clojure because it's a language I can see myself using with LLMs at work. But Typed Racket and Racket could make an especially interesting pair because of the gradual typing thing.
I'm not sure whether you want to include them in your project. The kind of selectivity you describe yourself as going for is hard for me, especially since I'm not the one doing the work. :)
PS: Aside from this benchmarking and comparison project: Racket is an interesting language and seems like a good place to start if you want to explore classic Scheme texts (Structure and Interpretation of Computer Programs, The Little Schemer, How to Design Programs) or newer ones that try to teach newer or more specialized ideas (e.g., The Little Typer). You may have to tweak the language a bit to stay faithful to some of those books, but that's something Racket is good at and there are already sources noting relevant differences online.
When a non-programmer in my life expressed curiosity about programming, we ended up starting HtDP together and it's been fun. I think Racket was a good choice for that.
Just want to be sure I'm reading the results correctly... When I compare GPT-5.5 with Mistral Medium 3.5, I see in the tables:
a) Mistral beats GPT in Java and C++
b) It's close for Rust
c) GPT-5.5 easily wins for Go, Javascript, Python and Typescript
Model choice really does appear to be language dependent (assuming I'm reading the results correctly).
The Qwen3.6 models have memorized some common games. For example, if you ask it to create an index.html with a snake game, it will generate almost the same high quality snake game every time. The relatively low success rate of 25% but high average percentile of almost 100% for one-shot coding in Python suggests that the model is extremely good at few tasks.
That's why a lot of people have been freaking out about local LLMs since april. There's finally a decent model that runs locally on a GPU or two that can do agentic programming at a reasonable enough tokens per second.
- Haskell
- OCaml
- F#
- Scala
- Gleam
- Purescript
- Grain
- Idris
Then I asked if there were any Schemes or Lisps that met the initial requirements, which added a bunch more options (Typed Racket, Typol, Elm, ReScript etc).
Then I asked about Julia specifically, as it's a language I'm already reasonably familiar with and knew that it's possible to write it with static annotations.
Next I started filtering the list based on additional criteria; didn't want to target a JS compilation target, performance, size of package ecosystem, tooling, community, learning curve (I do want to review and understand the output).
There were a bunch of follow-up questions over a few hours of prompting, reading and a couple of beers. All this resulted in the shortlist of OCaml, Typed Racket and Julia.
Julia pretty much remains in there, even though it doesn't really meet the strongly typed initial criteria, based on my familiarity, the ecosystem especially for AI/ML tasks and performance factors.
I know zero about OCaml and find the thought of learning it a bit daunting. Typed Racket seems more approachable anyway.
Also somehow the 2 language comparison graphs (avg percentile and success rate) rank Python in dramatically different positions, with Python outranking Rust and Java in the success rate. What does the avg percentile mean in this context?
Percentile compares only the submissions that didn't hard-fail. So they are a bit different, and we incorporate them both into the combined score.
Oh wow, we got "tribal domination", "market simulator" and "adversarial customer service". I don't know what those are but it sure sounds like big torment nexus milestones
Maybe we could at least play nicer games like hackenbush and act surprised when there's some wicked use-case that's isomorphic.
EDIT: Ok fine. I like "Rubik's Cube Chess" a lot. Never heard of it, is this analyzed formally at all? Hard to search for since there's tons of collisions
When we reason we need to typically propagate the constraints to arrive at a solution to these constraints. I think the best language to reason in could be something like Lean, which allows both constraints and actual code to be expressed at the same time. Although this might not be the case for current LLMs, as I explain above.
But of course, because the deductive reasoning is inductively taught, there might be various shortcuts which compromise the soundness of deductive reasoning. That's why my claim - LLMs are not as good at it as other algorithms, although they have many other strengths that make up for it.
Actually, JS can get a surprising amount of "intellisense" as well. Not sure if that was used here though.
TIL. If i were to start a truly vibe project; Go would have a significant leg up.
https://github.com/Tencent-Hunyuan/AutoCodeBenchmark/blob/ma...
In my opinion, the only thing holding elixir back as an llm deliverable is that there's not as much training data for llms to work with.
Of course if we had a new AI that could be trained on a minimum of existing training data, common lisp would absolutely beat out everything else. everything you mentioned about elixir (repl, runtime, and ability to hot reload / directly test functions) are possible and were invented in lisp with an AST instead of a syntactic language as the ultimate build artifact. CL lets you recover from exceptions and rewind the stack before reloading your fixes and continuing. I can't even fathom the workloads an LLM could conceive of working with that.
Q: Say, what does this Python code do?
A: Nobody f&%^ing knows.
I created a big Python codebase using AI, and the LLM constantly guesses arguments or dictionary formats wrong. Unit tests and stuff like pydantic help, but it's better to avoid that whole class of runtime errors altogether.
This is where I’ve found that a compiled, strongly typed language (any one really) works well with an LLM. With the little bits of friction that is part of writing a language like Go, the LLM can produce pretty decent (and readable) code.
2. Golang syntax and style is very verbose yet simple. There’s not as many options nor programming language to domain mapping needed as in Rust. Leads to needing less sophisticated LLM to spit out Golang than Rust successfully and efficiently.
There are go examples (and full blown programs) for anything, from servers to Kubernetes and Docker.
the other reason is if you really want async as is in vogue nowadays, function coloring - but this is rapidly becoming irrelevant, see article.
Maybe if you're working alone.
Even running them 5 times it's WAY more fun
Use Mypy in strict mode and run it in the post-turn hook of your LLM harness so the LLM has no choice but to obey it. And don't use overly general dictionary types when the keys are known at development time; use TypedDicts for annotations if you must use dicts at runtime.
rust is a better language in every way for LLMs: more precise typing, better compiler errors, fewer performance footguns, no race conditions, clear interface definitions and implementations
golang is easier for humans to quickly get productive, but the language is lacking in helpful features for an LLM
Typed, garbage collected, fast to compile and run, stdlib that includes just enough to work out of the box. I really don't like writing it by hand but for the LLM it's perfect.
Well, Java and Python do.
Java, C#, Python, Node.
It's simple (do you really ask why that's a selling point?)
It's fast to compile.
It's fast to run.
It's good with parallelism.
It has myriads of examples, and LLMs can pick it up well too.
It has good backing.
It has good tooling.
It's fun.
It statically compiles to a trivially deployable binary.
It's excellent at cross compiling.
It has good adoption.
2. It produces a dependency-less statically linked binary
3. Duck typed interfaces give you static typing with minimal ceremony. They are implemented even for types outside your own code base, which is a common pain point in Java or C#.
4. It compiles quickly
Go’s benefit are primarily around simplicity, readability, and concurrency.
Not that much. Looking at Rust or Haskell complexity, I don't really desire it.
Of course, your response admits, "second to Rust", which I am guessing is an unspoken question in the grandparent's mind.
Say I am building some app that I know will be CPU-bound, why choose Go over say... Swift?
Or when performance is the main but not the only difference, and there are many other benefits.
>Say I am building some app that I know will be CPU-bound, why choose Go over say... Swift?
Because unless you're building for macOS/iOS, Swift is really a no-go, with lackluster support for other platforms. Plus slow to build and convoluted.
Language religious wars are silly: you should choose a language based on your constraints and personal tastes. If there's no clear advantage of one language over another for a given task - then all the options are viable, pick one and get on with solving the problem.
That might be its core feature if you do agentic coding.
Garbage collection is not an issue for 99% of programs. And for those that it is, there are ways to mitigate the issue (e.g. there are extremely high performance trading system written in Java, where every last sub-millisecond counts).
Blanket fear of GC reminds me when new programmers learned about how assembly is lower level and can be faster, and wondered why everything is not written in assembly.
Or any of the faster typed languages you are most comfortable with, as you might need to look at the code some times. LLMs are great at writing and understanding C# and Java.
The great thing about LLM-assisted coding is that an experienced software engineer can acquire decent familiarity with a language quite quickly. And then has a useful sparring partner for understanding and using the quirks and features of a new language.
If I compare the results to another team that uses Python with Claude I see slightly better results on the Java side. Not because Claude knows that better, but because the tools are more rigid by default which creates more of a self correcting loop for Claude. The Python side has Pydantic, but it's a bit of an afterthought, while in Java you can't skip the type checking.
In the end you can do the same things on both sides, it's 95% a team/engineering culture difference. So pick the language that the team knows best.
Absolutely correct. Anthropic showed that 250 examples can "poison" an LLM -- independent of LLM activation count.
I have to steer models hard for C++. They constantly suggest std::variant :P
Godbolt got a 2x speed improvement switching from what he thought was a good fast impl to std:variant
Dimensionality gets bizarre in 1000-D space. Similarity and orthogonality express themselves in strange ways and each dimension codes different semantic meaning.
Therefore, if the training data is highly consistent you are by definition reducing some complexity and/or encoding better similarity.
In Go the statement
result, err := Storage.write(...)
Is almost always going to be followed by if err != nil { ... }
In a highly dynamic language you may not get try { Storage.write() } catch (error) { ... }
Unless explicitly asked for.https://github.com/Tencent-Hunyuan/AutoCodeBenchmark/blob/ma...
Being dynamic is secondary. A language that uses exceptions for errors does not always need to surround every try with a catch if the code doesn't need to. You have a top level handler that would catch everything.
...for which ample training data is available.
> This makes sense, given that they are derived from text translation systems.
...for languages with ample training data available.
Yes, LLMs can combine information in novel ways. They are wonderful in many respects. But they make far more mistakes if they can't lean on copious amounts of training data. Invent a toy language, write a spec, and ask them to use it. They will, but they will have a hard time.
The only code that exists on the internet for this is test data and a few docs in the github repo. It’s not wildly different from most scripting languages, from a syntax point of view, but it is definitely niche.
Both Codex and Claude figured it out real fast from an example script I was debugging. I was amazed at how well they picked up the minor differences between my script and others. This is basically on next to zero training data.
Would I ask it to produce anything super complex? Definitely not. But I’ve been impressed with how well it handles novel languages for small tasks.
Sure. But given the relation with translation systems, it seems far more likely that there are diminishing returns to larger volumes of training data.
An experienced Rust developer is going to be in a better position to drive an agent to generate useful Rust code than a Python programmer with little or no Rust experience. Not sure I agree with the author that everyone should just generate reams of Rust now.
At least if your get paged at 3am to fix the 300k AI-generated Django blog you’ll have a chance at figuring things out. Good luck to you if Claude is down at the same time. But still better than if it was in Rust if you have no experience with that language.
I don't think the training set matters that much, since there's no way they have my language in their training set!
Programming languages have a lot in common. Python is kind of odd when it comes to languages.
I won't be surprised if one day they do.
At least in their current form, I don't think they can independently design a language that is so much better than other available ones that it makes sense for them to use it.
There's a very good language for almost every use case already, designing one better than the ones already available is a VERY tall order.
It's almost like these languages aren't designed by morons, but built by teams of geniuses over a decade instead.
It's taken me 6 months of heavily steering an LLM to build a language that is not yet even ready for production use.
Maybe I'm the one slowing the LLM down. But it certainly does not seem that way.
The key to a good language for them - from my experience - is maximum expression plus minimum global complexity.
Anything that makes you manage memory lifetimes & memory safety is inherently unfriendly to LLMs - that's globally complex.
All scripting languages allow spaghetti aliases that let you hack your way into oblivion - and LLMs gladly ride that gravy train to hell.
Rust excels here, because it prevents the worst and is WAY more expressive than most people think.
Go has arguably the best runtime ever built, but it's not very expressive, and it still has a lot of problems from C and scripting languages - I don't think these types of languages will be the ones people chose to write code with for LLMs in the future.
Go for example has significantly less training data than Python, but LLMs are the best at it. Why? Go is often written the same. You go from project to project and the code looks all the same. There only a very few ways to write Go.
I especially found that there is no difference between languages based on that. All generated code's architecture is terrible, if you don't actively manually maintain them all the time. If you don't have a few 10s of thousands of finely architected code already in your codebase, from which they can understand how it should be really done. And the reason, I think, is quite simple: the average code on the internet - regardless of market penetration of the given language - is simply bad.
To the extent today's AI can reason, add this to the pile of evidence that you definitely need a harness. Counter to what you hear.. that seems true for SOTA and frontier, not just toy models. Lots of people were saying many years ago someone should test exactly this, because it's obvious. Someone at megacorp probably did try and decided not to publish because they thought it was bad optics.
I find that Claude can write great modern Python more or less out of the box, with minimal style guidance from me. I do have to nudge it from time to time to not do silly things, but overall it's really rather good.
So languages with dynamic typing might hide some errors until runtime, static typing one could catch that during compilation.
With dynamic ones you need way more tests to cover some of the scenarios that compiler does for others.
And there is significant amount of code written "for ages" in languages that were there longer, like C, C++, Java (yes, I know that python is quite old, older than Java - 1991).
edit: side -> site
So as the article points out, an iterative process that catches the mistakes at compile time is much more suited for an AI than one that catches them at runtime.
I still read the generated code, so I'm not quite willing to give up on Python yet though.
My programs are faster and more reliable than they’ve ever been.
That's actually part of the point. Almost no one writes types for Python and has complete type compliance. So all that training data is people just yoloing Python, writing a bunch of poor code in it.
I honestly can't believe any experienced software engineer would decide to build systems in Python these days.
Well, go on and do the experiment! Perhaps LLMs can right code as well in BF as Python but I don't recommend it because hallucinations are really hard to notice in BF.
If you are going to worry about high level computer languages and AI, you are going to have to start with getting to grips with machine code and assemblers and that. Once you know how say some Python code ends up being processed by your laptop CPU(s), then you will know when BF might be best!