The original model, aside from its programming mistakes, also misremembered the ...

pona-a on Feb 12, 2025 | parent | context | favorite | on: DeepScaleR: Surpassing O1-Preview with a 1.5B Mode...

The original model, aside from its programming mistakes, also misremembered the doubling formula. I hoped to see that solved, which it was, as well as maybe a more general performance boost from recovering some distillation loss.