undefined

points

[-]

I've noticed that with clojure(script) unless you specifically instruct them to keep nesting levels low, they can hit a point where they make a paren placement error and can't debug their way out of it. Although in my case while one model made the error then couldn't find what it had done, a second model that I switched to was then able to identify it and back it out. So I suspect this is a transient weakness in today's models, not something fundamental.

by gertlabs7 hours ago|

prev|

[-]

That's a good idea. Would you rather see Lisp or Scala? Any interest in Prolog? We are trying to be selective to keep the data concentrated, but we will eventually add a couple more, most likely to sample different programming paradigms.

by isityettime1 hours ago|

parent|

[-]

I think Clojure would probably make for a more interesting comparison because its syntax is more different from the other languages currently on there and it's less multi-paradigm than Scala is (it doesn't support OOP, it's more explicitly immutable-first). I think Scala is a lovely and cool language, but I'd be more interested in the Clojure comparison here.

Prolog night be interesting because I bet nobody is trying to train very hard on it, but I'm less directly interested in model performance with Prolog.

by 16594470914 hours ago|

parent|

prev|

[-]

If you are taking request, I was hoping to see clojure on there.

by andai3 hours ago|

parent|

[-]

My spider sense tells me the immutable-ness would help with correctness, but I'm not sure how much difference it would make in practice. Would love to see some numbers.

A relative lack of training data might have a bigger effect though.

by phillc736 hours ago|

parent|

prev|

[-]

Just last night I was going down the rabbit hole of "what's the best programming language to use for vibe coding." I came to a short list of:

a) Typed Racket

b) OCaml

c) Julia

I would love to see those three added to your benchmarks. And Mistral Medium 3.5 added to the LLM list, please.

by gertlabs5 hours ago|

parent|

[-]

Thanks for the recs, we will look into adding some of these, maybe OCaml for variety. I'm not familiar with Racket.

Mistral Medium 3.5 is on there, but you will have to scroll down pretty far to find it (does not perform well): https://gertlabs.com/rankings?mode=oneshot_coding

by isityettime58 minutes ago|

parent|

[-]

Racket is a variety of Scheme that grew up as a teaching language, but now also has a few other notable niches as well.

Typed Racket is to Racket as TypeScript is to JavaScript: it adds some additional static checks to an otherwise dynamic language via gradual typing. This pair of languages might help begin answer the question "does gradual typing generally help LLMs, or does TypeScript outperform JavaScript for incidental reasons?".

Among Lisps, I'm most interested in seeing Clojure because it's a language I can see myself using with LLMs at work. But Typed Racket and Racket could make an especially interesting pair because of the gradual typing thing.

I'm not sure whether you want to include them in your project. The kind of selectivity you describe yourself as going for is hard for me, especially since I'm not the one doing the work. :)

PS: Aside from this benchmarking and comparison project: Racket is an interesting language and seems like a good place to start if you want to explore classic Scheme texts (Structure and Interpretation of Computer Programs, The Little Schemer, How to Design Programs) or newer ones that try to teach newer or more specialized ideas (e.g., The Little Typer). You may have to tweak the language a bit to stay faithful to some of those books, but that's something Racket is good at and there are already sources noting relevant differences online.

When a non-programmer in my life expressed curiosity about programming, we ended up starting HtDP together and it's been fun. I think Racket was a good choice for that.

by phillc735 hours ago|

parent|

prev|

[-]

Thanks for that, I hadn't scrolled down far enough.

Just want to be sure I'm reading the results correctly... When I compare GPT-5.5 with Mistral Medium 3.5, I see in the tables:

a) Mistral beats GPT in Java and C++

b) It's close for Rust

c) GPT-5.5 easily wins for Go, Javascript, Python and Typescript

Model choice really does appear to be language dependent (assuming I'm reading the results correctly).

by gertlabs4 hours ago|

parent|

[-]

The deeper you go into the filters (single models, cross correlated by specific languages), the smaller your sample sizes. A known limitation, tbh I doubt Mistral is better than GPT 5.5 at programming in any specific language and probably hit a few lower quality generations by GPT 5.5 by chance (but I could be wrong! We're always adding more samples so data improves over time. We always prioritize largest sample counts for near-frontier models first).

by regularfry2 hours ago|

parent|

prev|

[-]

What's going on with Qwen3.6 27b? Filtered to Python it comes out at the top of the list, which seems... well, unlikely.

by johndough37 minutes ago|

parent|

[-]

While Qwen3.6 27B and 35B-A3B are very good, I am skeptical about them being that good. I think another factor is at play here.

The Qwen3.6 models have memorized some common games. For example, if you ask it to create an index.html with a snake game, it will generate almost the same high quality snake game every time. The relatively low success rate of 25% but high average percentile of almost 100% for one-shot coding in Python suggests that the model is extremely good at few tasks.

by 2ndorderthought1 hours ago|

parent|

prev|

[-]

Qwen3.6 27b is a really strong model.

by andai3 hours ago|

parent|

prev|

[-]

Those are some fine languages, but how did you pick them? What was the criterion?

by phillc732 hours ago|

parent|

[-]

The initial criteria was strongly typed and functional first. Using an LLM for answers, of course, that returned me a list that looked like:

- Haskell

- OCaml

- F#

- Scala

- Gleam

- Purescript

- Grain

- Idris

Then I asked if there were any Schemes or Lisps that met the initial requirements, which added a bunch more options (Typed Racket, Typol, Elm, ReScript etc).

Then I asked about Julia specifically, as it's a language I'm already reasonably familiar with and knew that it's possible to write it with static annotations.

Next I started filtering the list based on additional criteria; didn't want to target a JS compilation target, performance, size of package ecosystem, tooling, community, learning curve (I do want to review and understand the output).

There were a bunch of follow-up questions over a few hours of prompting, reading and a couple of beers. All this resulted in the shortlist of OCaml, Typed Racket and Julia.

Julia pretty much remains in there, even though it doesn't really meet the strongly typed initial criteria, based on my familiarity, the ecosystem especially for AI/ML tasks and performance factors.

I know zero about OCaml and find the thought of learning it a bit daunting. Typed Racket seems more approachable anyway.

by librasteve3 hours ago|

prev|

[-]

I just did a side-by-side with Claude Code Python vs. Raku for DSL use ... https://slangify.org if you are interested.