Yes this is quite shocking. They could have just had o3 fact check the slides an...

throwaway0123_5 · 2025-08-07T17:50:06 1754589006

I thought so too, but I gave it a screenshot with the prompt:

> good plot for my presentation?

and it didn't pick up on the issue. Part of its response was:

> Clear metric: Y-axis (“Accuracy (%), pass @1”) and numeric labels make the performance gaps explicit.

I think visual reasoning is still pretty far from text-only reasoning.

abirch · 2025-08-07T17:40:42 1754588442

o3 did fact check the slides and it fixed its lower score.