close
close
Why OpenAI's new model is such a big deal

Why OpenAI's new model is such a big deal

2 minutes, 43 seconds Read

I thought OpenAI's GPT-4o, the leading model at the time, would be perfectly suited to help. I asked it to create a short wedding poem, with the constraint that each letter could only appear a certain number of times, so that we could ensure that teams could reproduce it with the set of tiles provided. GPT-4o failed miserably. The model repeatedly insisted that its poem worked within the constraints, even though it didn't. It only counted the letters correctly after the fact, while continuing to deliver poems that didn't fit the prompt. Since we didn't have time to carefully hand-write the verses, we dropped the poem idea and instead asked guests to memorize a series of shapes made from colored tiles. (This ended up being a huge hit with our friends and family, who also competed in dodgeball, egg toss, and capture the flag.)

Last week, however, OpenAI released a new model called o1 (formerly known as “Strawberry” and before that as Q*) that beats GPT-4o by far for this kind of purpose.

Unlike previous models that were well-suited to language tasks like writing and editing, OpenAI o1 focuses on multi-step “thinking,” the kind of process required for higher math, coding, or other STEM-based questions. According to OpenAI, it uses a “chain of thought” technique. “It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach if the current one doesn't work,” the company wrote in a blog post on its website.

OpenAI’s tests indicate a resounding success. The model ranks in the 89th percentile on questions from the competitive organization Codeforces and would rank among the top 500 high school students in the U.S. Math Olympiad, which covers geometry, number theory and other mathematical topics. The model is also trained to answer doctoral-level questions in subjects ranging from astrophysics to organic chemistry.

On Math Olympiad questions, the new model is 83.3% accurate, compared to 13.4% for GPT-4o. On PhD-level questions, the average accuracy was 78%, compared to 69.7% for human experts and 56.1% for GPT-4o. (Given these successes, it's not surprising that the new model was pretty good at writing a poem for our wedding games, although it still wasn't perfect; it used more Ts and Ss than instructed.)

So why is this important? The majority of advances in LL.M. to date have been voice-driven, resulting in chatbots or voice assistants that can interpret, analyze, and generate words. Aside from getting many facts wrong, these LL.M.s have failed to demonstrate the skills needed to solve important problems in areas such as drug discovery, materials science, coding, or physics. OpenAI's o1 is one of the first signs that LL.M.s could soon become truly helpful companions to human researchers in these fields.

This is a big deal because it introduces a broad audience to thought chain thinking in an AI model, says Matt Welsh, AI researcher and founder of the LLM startup Fixie.

“The reasoning skills are built right into the model, you don't have to use separate tools to get similar results. I expect this will raise the bar for the performance of AI models,” says Welsh.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *