This doesn't have API access yet, but OpenAI seem to approve of the Codex API ba...

DrProtic · 2026-04-23T19:28:33 1776972513

That pelican you posted yesterday from a local model looks nicer than this one.

Edit: this one has crossed legs lol

BeetleB · 2026-04-23T19:38:14 1776973094

It really needs to pee.

stingraycharles · 2026-04-24T02:03:40 1776996220

OpenAI hired the guy behind OpenClaw, so it makes sense that they’re more lenient towards its usage.

thierrydamiba · 2026-04-24T12:39:38 1777034378

They basically bought OpenClaw right?

takethebus · 2026-04-24T16:09:43 1777046983

I believe the technical term is "acquihire"

GistNoesis · 2026-04-23T20:34:14 1776976454

Isn't it awful ? After 5.5 versions it still can't draw a basic bike frame. How is the front wheel supposed to turn sideways ?

jetrink · 2026-04-23T20:52:18 1776977538

I feel like if I attempted this, the bike frame would look fine and everything else would be completely unrecognizable. After all, a basic bike frame is just straight lines arranged in a fairly simple shape. It's really surprising that models find it so difficult, but they can make a pelican with panache.

nlawalker · 2026-04-23T21:03:32 1776978212

> a fairly simple shape

Bike frames are very hard to draw unless you've already consciously internalized the basic shape, see https://www.booooooom.com/2016/05/09/bicycles-built-based-on...

necubi · 2026-04-23T21:02:59 1776978179

Humans are also famously bad at drawing bicycles from memory https://www.gianlucagimini.it/portfolio-item/velocipedia/

billywhizz · 2026-04-23T23:23:29 1776986609

why do you find it surprising? these models have no actual understanding of anything, never mind the physical properties and capabilities of a bicycle.

rimliu · 2026-04-24T07:07:43 1777014463

Sad to see this downvoted. So many people think that LLM have understanding?

fragmede · 2026-04-23T21:02:47 1776978167

My question is, as a human, how well would you or I do under the same conditions? Which is to say, I could do a much better job in inkscape with Google images to back me up, but if I was blindly shitting vectors into an XML file that I can't render to see the results of, I'm not even going to get the triangles for the frame to line up, so this pelican is very impressive!

simonw · 2026-04-23T20:39:37 1776976777

Yeah, the bike frame is the thing I always look at first - it's still reasonably rare for a model to draw that correctly, although Qwen 3.6 and Gemini Pro 3.1 do that well now.

loa_in_ · 2026-04-23T20:51:40 1776977500

The distinction is that it's not drawing. It's generating an SVG document containing descriptors of the shapes.

postalcoder · 2026-04-23T19:42:55 1776973375

I made pelicans at different thinking efforts:

https://hcker.news/pelican-low.svg

https://hcker.news/pelican-medium.svg

https://hcker.news/pelican-high.svg

https://hcker.news/pelican-xhigh.svg

Someone needs to make a pelican arena, I have no idea if these are considered good or not.

deflator · 2026-04-23T19:46:07 1776973567

They are not good, and they seem to get worse as you increased effort. Weird

postalcoder · 2026-04-23T19:51:09 1776973869

Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.

throw310822 · 2026-04-23T19:58:06 1776974286

No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s

seanw444 · 2026-04-23T19:58:24 1776974304

Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?

simonw · 2026-04-23T20:13:05 1776975185

I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.

jimbokun · 2026-04-23T20:43:56 1776977036

What it has going for it is human interpretability.

Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.

billywhizz · 2026-04-24T17:45:26 1777052726

how can you say "it ended up being a surprisingly good measure of the quality of the model for other tasks" and also "It should not be treated as a serious benchmark" in the same comment?

if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.

this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.

simonw · 2026-04-24T18:46:45 1777056405

I genuinely don't see how those two statements conflict with each other.

Despite not being a serious benchmark (how could it be serious? It's a pelican riding a bicycle!) it still turned out to have some value. You can see that just by scrolling through the archives and watching it improve as the models improved.

If your definition of doublethink is "holding two conflicting ideas in your head at once" then I would say doublethink is a necessary skill for navigating the weird AI era we find ourselves inhabiting.

billywhizz · 2026-04-24T19:38:37 1777059517

"some value" is not the same as "a surprisingly good measure of the quality of the model for other tasks".

doublethink does not mean holding two conflicting ideas in your head at once. it means holding two logically inconsistent positions/beliefs at the same time.

redox99 · 2026-04-23T20:05:36 1776974736

It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.

Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.

Gander5739 · 2026-04-23T20:04:09 1776974649

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

CamperBob2 · 2026-04-23T20:06:03 1776974763

It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.

It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.

If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.

bravoetch · 2026-04-23T20:36:57 1776976617

I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.

a96 · 2026-04-24T07:54:27 1777017267

G code and ascii art are also text formats, but seem to be beyond most if not all models.

(There are some that generate 3d models specifically, more in the image generation family than chatbot family.)

lexarflash8g · 2026-04-23T23:02:57 1776985377

None of them have the pelican's feet placed properly on the pedals -- or the pedals are misrepresented. Cool art style but not physically accurate.

a96 · 2026-04-24T07:52:47 1777017167

I'm not sure a physically accurate pelican would reach two pedals on a common bicycle. Maybe a model can solve that problem one day.

lostmsu · 2026-04-24T00:40:37 1776991237

https://pelicans.borg.games/

droidjj · 2026-04-23T19:31:52 1776972712

It's... like no pelican I've ever seen before.

hagbard_c · 2026-04-23T23:37:43 1776987463

You've never seen pelicans riding bicycles either so maybe these are just representations of those specific subgroups of pelicans which are capable of riding them. Normal pelicans would not feel the need to ride bikes since they can fly, these special pelicans mostly seem to lack the equipment needed to do that which might be part of the reason they evolved to ride two-wheeled pedal-propelled vehicles.

matt3210 · 2026-04-24T04:17:29 1777004249

The pelican doesn’t really matter anymore since models are tuned for it knowing people will ask.

simonw · 2026-04-24T05:07:20 1777007240

They suck at tuning for it.

XCSme · 2026-04-23T19:38:46 1776973126

Is this direct API usage allowed by their terms? I remember Anthropic really not liking such usage.

simonw · 2026-04-23T20:11:30 1776975090

Apparently it's fine: https://twitter.com/romainhuet/status/2038699202834841962

deflator · 2026-04-23T19:42:29 1776973349

Hmm. Any idea why it's so much worse than the other ones you have posted lately? Even the open weight local models were much better, like the Qwen one you posted yesterday.

simonw · 2026-04-23T20:11:25 1776975085

The xhigh one was better, but clearly OpenAI have not been focusing their training efforts on SVG illustrations of animals riding modes of transport!

irthomasthomas · 2026-04-23T20:18:50 1776975530

It beats opus-4.7 but looks like open models actually have the lead here.

Schlagbohrer · 2026-04-23T21:52:55 1776981175

That's amazing that the default did that much in just 39 "reasoning tokens" (no idea what a reasoning token is but that's still shockingly few tokens)

erdaniels · 2026-04-23T22:11:37 1776982297

If you don't know what a reasoning token is, then how can 39 be considered shockingly few?

Culonavirus · 2026-04-24T00:08:55 1776989335

It's less than 67, duh.

tclancy · 2026-04-24T01:38:48 1776994728

Not during peak hours.

mannanj · 2026-04-24T14:30:21 1777041021

Does OpenAI actually act open for once here, and allow using their model via a subscription over Anthrophic banning use in Openclaw?

simonw · 2026-04-24T15:56:56 1777046216

That's what they said on Twitter.

noonething · 2026-04-23T21:48:22 1776980902

Thank you for doing all this. It's appreciated.

i_love_retros · 2026-04-24T02:00:17 1776996017

You do realise they are doing it for self promotion right?

simonw · 2026-04-24T02:33:24 1776998004

I mean, yeah. "Person who spends time publishing content online is doing it for self promotion" doesn't seem particularly notable to me. 24 years of self promotion and counting!

i_love_retros · 2026-04-24T12:28:29 1777033709

Dude it comes across, maybe only to me, as a bit shameless. Or maybe it's just that there are so many people lapping it up like you're doing a public service that I find tedious. I wish hackernews had a block feature but alas it doesn't. Maybe I'll vibecode a browser extension.

fc417fc802 · 2026-04-24T07:10:17 1777014617

I am always outraged when youtube creators ask me to like and subscribe. /s

i_love_retros · 2026-04-24T12:47:44 1777034864

Not the same at all. For that to happen you would have to explicitly visit their channel (forgive incorrect terminology, I don't use youtube). If someone kept posting on hackernews asking you to subscribe I hope you wouldn't appreciate it. swillison is spamming a communal public feed with self promotional comments about vibe coding, quite obviously because they, like the rest of us, are panicking about not having a career in a few years.

simonw · 2026-04-24T13:36:26 1777037786

The more time I spend actually working with these tools the less I fear for my future career.

Building software remains really hard. Most people are not going to be able to produce production quality software systems, no matter how good the AI tooling gets.

fc417fc802 · 2026-04-24T19:32:44 1777059164

Conversely, if the models ever make it to the point where they can replace ~all developers we will presumably have achieved AGI or even ASI and all other jobs will also be eliminated more or less simultaneously. So at least we'll all be in good company (and there probably won't be much point to marketing yourself in that case).

fc417fc802 · 2026-04-24T19:27:31 1777058851

Forums traditionally included signature blocks at the end of messages. If someone linked his youtube channel there would that be objectionable? Assuming the preceding message was on point of course.

Posts on HN are analogous to videos on youtube. A channel is analogous to an HN user profile.

SkyBelow · 2026-04-23T20:15:55 1776975355

Wait, I thought we were onto racoons on e-scooters to avoid (some of) the issues with Goodhart's Law coming into play.

simonw · 2026-04-23T20:22:24 1776975744

I fall back to possums on e-scooters if the pelican looks too good to be true. These aren't good enough for me to suspect any fowl play.

zerop · 2026-04-24T08:57:43 1777021063

So pelican must have become the mandatory test case to pass for all model providers before launch.

andriy_koval · 2026-04-23T19:41:01 1776973261

what is your setup for drawing pelican? Do you ask model to check generated image, find issues and iterate over it which would demonstrate models real abilities?

simonw · 2026-04-23T20:12:07 1776975127

It's generally one-shot-only - whatever comes out the first time is what I go with.

I've been contemplating a more fair version where each model gets 3-5 attempts and then can select which rendered image is "best".

irthomasthomas · 2026-04-23T20:19:59 1776975599

Try llm-consortium with --judging-method rank

andriy_koval · 2026-04-23T20:14:43 1776975283

I think it will make results way better and more representative of model abilities..

simonw · 2026-04-23T20:16:27 1776975387

It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.

gpm · 2026-04-23T20:27:17 1776976037

I for one delight in bicycles where neither wheel can turn!

It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.

Also mildly interesting, and generally consistent with my experience with LLMs, that it produced the same obvious geometry issue both times.

lxgr · 2026-04-23T20:58:07 1776977887

> It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.

I feel like the main problem for the models is that they can't actually look at the visual output produced by their SVG and iterate. I'm almost willing to bet that if they could, they'd absolutely nail it at this point.

Imagine designing an SVG yourself without being able to ever look outside the XML editor!

gpm · 2026-04-23T21:03:04 1776978184

> Imagine designing an SVG yourself without being able to ever look outside the XML editor!

I honestly think I could do much better on the bicycle without looking at the output (with some assistance for SVG syntax which I definitely don't know), just as someone who rides them and generally knows what the parts are.

I'd do worse at the pelicans though.

singingtoday · 2026-04-24T01:00:03 1776992403

Thank you for continuing to post these! Very interesting benchmark.

rolymath · 2026-04-23T20:17:15 1776975435

Exciting. Another Pelican post.

refulgentis · 2026-04-23T20:33:16 1776976396

It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt and there's obvious ways to better it and it's not worth doing because it's not serious and if you say anything at all about the thread it's off-topic so you're doing exactly what you're complaining about and it's a personal attack from the fun police.

Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.

simonw · 2026-04-23T20:41:20 1776976880

See if you can spot what's interesting and unique about this one. I've been trying to put more than just a pelican in there, partly as a nod to people who are getting bored of them.

sjdv1982 · 2026-04-23T20:15:32 1776975332

At some point, OpenAI is going to cheat and hardcode a pelican on a bicycle into the model. 3D modelling has Suzanne and the teapot; LLMs will have the pelican.

dakolli · 2026-04-23T20:23:14 1776975794

You know they are 1000% training these models to draw pelicans, this hasn't been a valid benchmark for 6 months +

simonw · 2026-04-23T20:41:58 1776976918

OpenAI must be very bad at training models to draw pelicans (and bicycles) then.

Legend2440 · 2026-04-23T20:59:07 1776977947

Skeptism is out of control these days, any time an LLM does something cool it must have been cheating.

dakolli · 2026-04-25T04:46:56 1777092416

they legitimately suck at everything they don't have concrete examples to copy from.