Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would love to see how they do with functional languages and especially Lisps here. I've noticed pretty good performance with Emacs Lisp relative to overall model strength, but I haven't used LLMs to application code in any such languages.

It would also be interesting to see how Python compares to other languages in its niche (Ruby, Perl, Raku).

Thanks for putting this together! It's interesting.



That's a good idea. Would you rather see Lisp or Scala? Any interest in Prolog? We are trying to be selective to keep the data concentrated, but we will eventually add a couple more, most likely to sample different programming paradigms.


I think Clojure would probably make for a more interesting comparison because its syntax is more different from the other languages currently on there and it's less multi-paradigm than Scala is (it doesn't support OOP, it's more explicitly immutable-first). I think Scala is a lovely and cool language, but I'd be more interested in the Clojure comparison here.

Prolog night be interesting because I bet nobody is trying to train very hard on it, but I'm less directly interested in model performance with Prolog.


> it doesn't support OOP

That is only accurate if OOP means "inheritance-based class hierarchies with mutable state" - which is one narrow definition of it. Clojure has solid OOP support, just not in the class-hierarchy-first sense.


In this context, what I mean is "Clojure seems like a more interesting example because inheritance-based class hierarchies with mutable state are a footgun and it lacks them; I wonder if that will help LLMs be any more effective in the language." :)

But I am curious about your favorite OOP-y tools in Clojure. I know it has flexible dispatch, it has a notion of agents that are a bit like objects in how they encapsulate state... but it's been a long time since I really used Clojure and I don't have a clear picture of what the best OOP-y idioms in Clojure look like or what makes them good to use.

Care to explain a bit more?


Clojure is opinionated about state and identity, not about paradigm in general. It will accommodate whatever paradigms as long as you respect its model of how state gets managed. But try to smuggle some pervasive mutability and you'll feel resistance. Immutability and explicit state management are non-negotiable defaults, and most paradigms fit comfortably within that constraint. Clojure bends well toward:

- Functional (its primary identity)

- Data-oriented (how you model your domain)/data-driven(how you control behaivor)

- Polymorphic/interface-driven (protocols, multimethods)

- Logic-style (via core.logic) - mostly unused

- Reactive/event-driven/actor-like concurrency (core.async, manifold)

- There's another paradigm vector (Spec/Malli), but it frankly has the gap in the terminology landscape. It's neither Dependent Typing, nor Gradual; not quite Contract-based Design; not Refinement Types (typically a static concept); calling it Schema Validation really undersells it - implies just input checking. It does something genuinely novel in combination. There isn't a single established term to capture all of that.

Where the language resists:

- Classical OOP with mutable stateful objects - you can do it via Java interop or careful use of atoms+records, but the language actively nudges you away. It won't feel natural and you'll be fighting the grain.

- Imperative/procedural style - possible, but again, why?


If you are taking request, I was hoping to see clojure on there.


My spider sense tells me the immutable-ness would help with correctness, but I'm not sure how much difference it would make in practice. Would love to see some numbers.

A relative lack of training data might have a bigger effect though.


> A relative lack of training data might have a bigger effect though.

Nope. Not with Lisps. I've been using LLMs with Clojure/Clojurescript, Elisp and Fennel - for my personal stuff. And Python, Java, JS/TS, Go for work. LLMs are surprisingly good with Lisps, perhaps precisely because there's less fragmentation. There is so much variability with say Python for a given task, because the training set is enormous. But how many ways there exist to do the same thing in Clojure? Python/Java/TS's enormous training set is almost a liability for quality - the model has seen every beginner tutorial, every legacy pattern, every conflicting style guide. With Clojure it's more like the model learned from a curated corpus by default. Lisps have nearly zero syntactic noise - the AST is the source. This means an LLM doesn't need to learn parsing heuristics; structure is always explicit. That likely makes correct generation easier even with less data.


Just last night I was going down the rabbit hole of "what's the best programming language to use for vibe coding." I came to a short list of:

a) Typed Racket

b) OCaml

c) Julia

I would love to see those three added to your benchmarks. And Mistral Medium 3.5 added to the LLM list, please.


Thanks for the recs, we will look into adding some of these, maybe OCaml for variety. I'm not familiar with Racket.

Mistral Medium 3.5 is on there, but you will have to scroll down pretty far to find it (does not perform well): https://gertlabs.com/rankings?mode=oneshot_coding


Racket is a variety of Scheme that grew up as a teaching language, but now also has a few other notable niches as well.

Typed Racket is to Racket as TypeScript is to JavaScript: it adds some additional static checks to an otherwise dynamic language via gradual typing. This pair of languages might help begin answer the question "does gradual typing generally help LLMs, or does TypeScript outperform JavaScript for incidental reasons?".

Among Lisps, I'm most interested in seeing Clojure because it's a language I can see myself using with LLMs at work. But Typed Racket and Racket could make an especially interesting pair because of the gradual typing thing.

I'm not sure whether you want to include them in your project. The kind of selectivity you describe yourself as going for is hard for me, especially since I'm not the one doing the work. :)

PS: Aside from this benchmarking and comparison project: Racket is an interesting language and seems like a good place to start if you want to explore classic Scheme texts (Structure and Interpretation of Computer Programs, The Little Schemer, How to Design Programs) or newer ones that try to teach newer or more specialized ideas (e.g., The Little Typer). You may have to tweak the language a bit to stay faithful to some of those books, but that's something Racket is good at and there are already sources noting relevant differences online.

When a non-programmer in my life expressed curiosity about programming, we ended up starting HtDP together and it's been fun. I think Racket was a good choice for that.


Thanks for that, I hadn't scrolled down far enough.

Just want to be sure I'm reading the results correctly... When I compare GPT-5.5 with Mistral Medium 3.5, I see in the tables:

a) Mistral beats GPT in Java and C++

b) It's close for Rust

c) GPT-5.5 easily wins for Go, Javascript, Python and Typescript

Model choice really does appear to be language dependent (assuming I'm reading the results correctly).


The deeper you go into the filters (single models, cross correlated by specific languages), the smaller your sample sizes. A known limitation, tbh I doubt Mistral is better than GPT 5.5 at programming in any specific language and probably hit a few lower quality generations by GPT 5.5 by chance (but I could be wrong! We're always adding more samples so data improves over time. We always prioritize largest sample counts for near-frontier models first).


What's going on with Qwen3.6 27b? Filtered to Python it comes out at the top of the list, which seems... well, unlikely.


While Qwen3.6 27B and 35B-A3B are very good, I am skeptical about them being that good. I think another factor is at play here.

The Qwen3.6 models have memorized some common games. For example, if you ask it to create an index.html with a snake game, it will generate almost the same high quality snake game every time. The relatively low success rate of 25% but high average percentile of almost 100% for one-shot coding in Python suggests that the model is extremely good at few tasks.


The more filters applied (one-shot coding only, Python only), the more variation you can expect from fewer samples -- that being said, it really is a great model so it's probably not too far above where it would end up with infinite samples.


Qwen3.6 27b is a really strong model.


Yeah but that strong?


Yes that strong. Its only lacking in context length, but it's not that small there and it gets caught in circles more often then say a 1t parameter model does.

That's why a lot of people have been freaking out about local LLMs since april. There's finally a decent model that runs locally on a GPU or two that can do agentic programming at a reasonable enough tokens per second.


> it gets caught in circles more often then say a 1t parameter model does.

I've found that the Q5+ quants are less loopy than Q4. Still not perfect, but noticeably better.

> reasonable enough tokens per second

The speed has been amazing. I've been running the recent llama.cpp MTP branch with an uncensored variant of Qwen3.6-35B-A3B on my RTX 3090 over 170 tokens per second and it was able to turn a buffer overflow into a reliable shell exploit in just a few seconds (with reasoning disabled). Still a bit loopy though. Hopefully, the Qwen team will pay more attention to those looping issues. It feels like their models are especially susceptible.


Is that on a single 3090? I need to change my settings it sounds like


Yes, single RTX 3090 with this model https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-h... following these https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF instructions (should add "-j 8" to last cmake command for parallel build) and llama-server with --reasoning off

Note that the MTP PR https://github.com/ggml-org/llama.cpp/pull/22673 is still under development, so things might be broken.


Those are some fine languages, but how did you pick them? What was the criterion?


The initial criteria was strongly typed and functional first. Using an LLM for answers, of course, that returned me a list that looked like:

- Haskell

- OCaml

- F#

- Scala

- Gleam

- Purescript

- Grain

- Idris

Then I asked if there were any Schemes or Lisps that met the initial requirements, which added a bunch more options (Typed Racket, Typol, Elm, ReScript etc).

Then I asked about Julia specifically, as it's a language I'm already reasonably familiar with and knew that it's possible to write it with static annotations.

Next I started filtering the list based on additional criteria; didn't want to target a JS compilation target, performance, size of package ecosystem, tooling, community, learning curve (I do want to review and understand the output).

There were a bunch of follow-up questions over a few hours of prompting, reading and a couple of beers. All this resulted in the shortlist of OCaml, Typed Racket and Julia.

Julia pretty much remains in there, even though it doesn't really meet the strongly typed initial criteria, based on my familiarity, the ecosystem especially for AI/ML tasks and performance factors.

I know zero about OCaml and find the thought of learning it a bit daunting. Typed Racket seems more approachable anyway.


I've noticed that with clojure(script) unless you specifically instruct them to keep nesting levels low, they can hit a point where they make a paren placement error and can't debug their way out of it. Although in my case while one model made the error then couldn't find what it had done, a second model that I switched to was then able to identify it and back it out. So I suspect this is a transient weakness in today's models, not something fundamental.


I don’t know, I think it might be fundamental. They think in tokens, and )))))))) might look equivalent to ))))))). Just like the strawberry problem.

The calva backseat driver extension even includes a specific paren balancer for this reason, and it works quite well


It's fundamental in that it's harder because there's less information per token. But we know it's not impossible because they can get nesting right at all, it's just a question of where the boundary is today. And if different models have different crapping-out points, then there's a gradient there and future models can do better.

In token terms it's more like the fingers problem than the strawberry problem. ")" is a single token, but the model gets confused by several repeats of the same thing.


It's a bit of a pitiful way to fail. I wonder if diffusion models could handle parenthesis matching better. And I wonder if you could rig up tools for structural editing like with paredit.


It's one of the drawbacks of having quite so little syntax. There's just less to grab hold of.


> can't debug their way out of it

Whoa, it seems you're using LLM to generate Clojure code like you'd do it with any other "static" PL. Give it a live REPL - it works wondrously.


That's because you are holding it wrong. Just replace the ( with rs, like in strawberry.


I just did a side-by-side with Claude Code Python vs. Raku for DSL use ... https://slangify.org if you are interested.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: