I would love to see how they do with functional languages and especially Lisps here. I've noticed pretty good performance with Emacs Lisp relative to overall model strength, but I haven't used LLMs to application code in any such languages.
It would also be interesting to see how Python compares to other languages in its niche (Ruby, Perl, Raku).
Thanks for putting this together! It's interesting.
That's a good idea. Would you rather see Lisp or Scala? Any interest in Prolog? We are trying to be selective to keep the data concentrated, but we will eventually add a couple more, most likely to sample different programming paradigms.
I think Clojure would probably make for a more interesting comparison because its syntax is more different from the other languages currently on there and it's less multi-paradigm than Scala is (it doesn't support OOP, it's more explicitly immutable-first). I think Scala is a lovely and cool language, but I'd be more interested in the Clojure comparison here.
Prolog night be interesting because I bet nobody is trying to train very hard on it, but I'm less directly interested in model performance with Prolog.
That is only accurate if OOP means "inheritance-based class hierarchies with mutable state" - which is one narrow definition of it. Clojure has solid OOP support, just not in the class-hierarchy-first sense.
In this context, what I mean is "Clojure seems like a more interesting example because inheritance-based class hierarchies with mutable state are a footgun and it lacks them; I wonder if that will help LLMs be any more effective in the language." :)
But I am curious about your favorite OOP-y tools in Clojure. I know it has flexible dispatch, it has a notion of agents that are a bit like objects in how they encapsulate state... but it's been a long time since I really used Clojure and I don't have a clear picture of what the best OOP-y idioms in Clojure look like or what makes them good to use.
Clojure is opinionated about state and identity, not about paradigm in general. It will accommodate whatever paradigms as long as you respect its model of how state gets managed. But try to smuggle some pervasive mutability and you'll feel resistance. Immutability and explicit state management are non-negotiable defaults, and most paradigms fit comfortably within that constraint. Clojure bends well toward:
- Functional (its primary identity)
- Data-oriented (how you model your domain)/data-driven(how you control behaivor)
- There's another paradigm vector (Spec/Malli), but it frankly has the gap in the terminology landscape. It's neither Dependent Typing, nor Gradual; not quite Contract-based Design; not Refinement Types (typically a static concept); calling it Schema Validation really undersells it - implies just input checking. It does something genuinely novel in combination. There isn't a single established term to capture all of that.
Where the language resists:
- Classical OOP with mutable stateful objects - you can do it via Java interop or careful use of atoms+records, but the language actively nudges you away. It won't feel natural and you'll be fighting the grain.
- Imperative/procedural style - possible, but again, why?
My spider sense tells me the immutable-ness would help with correctness, but I'm not sure how much difference it would make in practice. Would love to see some numbers.
A relative lack of training data might have a bigger effect though.
> A relative lack of training data might have a bigger effect though.
Nope. Not with Lisps. I've been using LLMs with Clojure/Clojurescript, Elisp and Fennel - for my personal stuff. And Python, Java, JS/TS, Go for work. LLMs are surprisingly good with Lisps, perhaps precisely because there's less fragmentation. There is so much variability with say Python for a given task, because the training set is enormous. But how many ways there exist to do the same thing in Clojure? Python/Java/TS's enormous training set is almost a liability for quality - the model has seen every beginner tutorial, every legacy pattern, every conflicting style guide. With Clojure it's more like the model learned from a curated corpus by default. Lisps have nearly zero syntactic noise - the AST is the source. This means an LLM doesn't need to learn parsing heuristics; structure is always explicit. That likely makes correct generation easier even with less data.
Racket is a variety of Scheme that grew up as a teaching language, but now also has a few other notable niches as well.
Typed Racket is to Racket as TypeScript is to JavaScript: it adds some additional static checks to an otherwise dynamic language via gradual typing. This pair of languages might help begin answer the question "does gradual typing generally help LLMs, or does TypeScript outperform JavaScript for incidental reasons?".
Among Lisps, I'm most interested in seeing Clojure because it's a language I can see myself using with LLMs at work. But Typed Racket and Racket could make an especially interesting pair because of the gradual typing thing.
I'm not sure whether you want to include them in your project. The kind of selectivity you describe yourself as going for is hard for me, especially since I'm not the one doing the work. :)
PS: Aside from this benchmarking and comparison project: Racket is an interesting language and seems like a good place to start if you want to explore classic Scheme texts (Structure and Interpretation of Computer Programs, The Little Schemer, How to Design Programs) or newer ones that try to teach newer or more specialized ideas (e.g., The Little Typer). You may have to tweak the language a bit to stay faithful to some of those books, but that's something Racket is good at and there are already sources noting relevant differences online.
When a non-programmer in my life expressed curiosity about programming, we ended up starting HtDP together and it's been fun. I think Racket was a good choice for that.
The deeper you go into the filters (single models, cross correlated by specific languages), the smaller your sample sizes. A known limitation, tbh I doubt Mistral is better than GPT 5.5 at programming in any specific language and probably hit a few lower quality generations by GPT 5.5 by chance (but I could be wrong! We're always adding more samples so data improves over time. We always prioritize largest sample counts for near-frontier models first).
While Qwen3.6 27B and 35B-A3B are very good, I am skeptical about them being that good. I think another factor is at play here.
The Qwen3.6 models have memorized some common games. For example, if you ask it to create an index.html with a snake game, it will generate almost the same high quality snake game every time. The relatively low success rate of 25% but high average percentile of almost 100% for one-shot coding in Python suggests that the model is extremely good at few tasks.
The more filters applied (one-shot coding only, Python only), the more variation you can expect from fewer samples -- that being said, it really is a great model so it's probably not too far above where it would end up with infinite samples.
Yes that strong. Its only lacking in context length, but it's not that small there and it gets caught in circles more often then say a 1t parameter model does.
That's why a lot of people have been freaking out about local LLMs since april. There's finally a decent model that runs locally on a GPU or two that can do agentic programming at a reasonable enough tokens per second.
> it gets caught in circles more often then say a 1t parameter model does.
I've found that the Q5+ quants are less loopy than Q4. Still not perfect, but noticeably better.
> reasonable enough tokens per second
The speed has been amazing. I've been running the recent llama.cpp MTP branch with an uncensored variant of Qwen3.6-35B-A3B on my RTX 3090 over 170 tokens per second and it was able to turn a buffer overflow into a reliable shell exploit in just a few seconds (with reasoning disabled). Still a bit loopy though. Hopefully, the Qwen team will pay more attention to those looping issues. It feels like their models are especially susceptible.
The initial criteria was strongly typed and functional first. Using an LLM for answers, of course, that returned me a list that looked like:
- Haskell
- OCaml
- F#
- Scala
- Gleam
- Purescript
- Grain
- Idris
Then I asked if there were any Schemes or Lisps that met the initial requirements, which added a bunch more options (Typed Racket, Typol, Elm, ReScript etc).
Then I asked about Julia specifically, as it's a language I'm already reasonably familiar with and knew that it's possible to write it with static annotations.
Next I started filtering the list based on additional criteria; didn't want to target a JS compilation target, performance, size of package ecosystem, tooling, community, learning curve (I do want to review and understand the output).
There were a bunch of follow-up questions over a few hours of prompting, reading and a couple of beers. All this resulted in the shortlist of OCaml, Typed Racket and Julia.
Julia pretty much remains in there, even though it doesn't really meet the strongly typed initial criteria, based on my familiarity, the ecosystem especially for AI/ML tasks and performance factors.
I know zero about OCaml and find the thought of learning it a bit daunting. Typed Racket seems more approachable anyway.
I've noticed that with clojure(script) unless you specifically instruct them to keep nesting levels low, they can hit a point where they make a paren placement error and can't debug their way out of it. Although in my case while one model made the error then couldn't find what it had done, a second model that I switched to was then able to identify it and back it out. So I suspect this is a transient weakness in today's models, not something fundamental.
It's fundamental in that it's harder because there's less information per token. But we know it's not impossible because they can get nesting right at all, it's just a question of where the boundary is today. And if different models have different crapping-out points, then there's a gradient there and future models can do better.
In token terms it's more like the fingers problem than the strawberry problem. ")" is a single token, but the model gets confused by several repeats of the same thing.
It's a bit of a pitiful way to fail. I wonder if diffusion models could handle parenthesis matching better. And I wonder if you could rig up tools for structural editing like with paredit.
It would also be interesting to see how Python compares to other languages in its niche (Ruby, Perl, Raku).
Thanks for putting this together! It's interesting.