The Thinker: ChatGPT gets a serious brain upgrade

September 13, 2024

The thinker: A new o1 model, pictured here using generative tools, greatly expands ChatGPT's planning, thinking and reasoning capabilities

View 3 Images

1/3

The thinker: A new o1 model, pictured here using generative tools, greatly expands ChatGPT's planning, thinking and reasoning capabilities

2/3

Competition-grade math and coding performance is radically improved

OpenAI

3/3

Interestingly, the o1 model doesn't seem to improve at all upon GPT-4o's English writing capabilities

OpenAI

View gallery - 3 images

OpenAI's latest game-changing AI release has dropped. The new o1 model, now available in ChatGPT, now 'thinks' before it responds – and it's starting to crush both previous models and Ph.D-holding humans at solving expert-level problems.

It kinda felt like OpenAI was giving us a little room to breathe there, didn't it? I mean, GPT-4o and its (scandalous, but still largely unavailable) advanced voice mode were announced back in May, but those really felt like minor advances. The Sora text-to-video generator really spun people's heads around back in February, but that's still not available to the public, even though several Chinese competitors now appear to be delivering similar quality.

There's been all sorts of speculation around what GPT-5 might look like, when it'll launch, and whether it's already achieved some definition of Artificial General Intelligence (AGI) – but last night, OpenAI took things in a different direction, splitting a new model off from the GPT lineage.

Introducing o1: The thinker

The new model is called o1. It's already enabled in 100% of ChatGPT user accounts as an option you can call on. And while GPT-4o (omni) remains the all-round workhorse model most relevant for most tasks, o1 is a specialist of sorts.

Its specialty is complex reasoning. And the superpower that separates it from the GPT models is... That it stops and 'thinks' instead of starting to answer you straight away.

It's often tempting to anthropomorphize language models like this; they're not human, but since they're trained on so much of humanity, there are often spooky parallels. In this case, o1 achieves a vastly higher performance on difficult tasks than previous models, essentially by sketching out all the things it's got to work with, breaking a large job down into smaller tasks, recursively checking over its work and challenging its own assumptions – all behind the scenes, and before it starts to give you an answer.

So while GPT-4o typically gets straight down to writing code, or generating images, or writing an answer, o1 might sit and ponder the question for a while, planning out its path of attack. It's not a long while – maybe 10-20 seconds – but it seems to make all the difference when it comes to the kinds of hard problems these LLMs have typically struggled with.

Indeed, the longer it thinks, the better it seems to get, and while the products released today will ponder things for a matter of seconds, OpenAI says it'll probably make sense to release future versions that'll spend hours, days or even weeks chewing their way carefully through massive, complex problems, producing lots of solutions, testing them against one another and finally giving you an answer.

o1's current limitations

As it stands, o1 is now available in "Preview" and "mini" models. They can write and execute code, but these are beta previews, and they're missing a couple of key components:

You can't upload files to them
They don't appear to have access to GPT-4o's memory, or your personal custom system prompts, so they don't know anything about you
They can't browse the web for new information past their training cutoff - October 2023.

In general writing tasks, and anything that needs file uploads or web access, GPT-4o is still going to be much more useful – but on the other hand, it's possible to get GPT-4o to assemble a bunch of useful assets and do some pre-analysis, then package up the problem in a prompt for its smarter, but more isolated, new friend.

How good is the o1 model?

These launches are always accompanied by lots of graphs, so let's see a few – starting with the new model's performance at OpenAI's own coding test for research engineers... At which, given the chance to try the problems 128 times and submit the best of its answers, both the mini and preview models scored 100%.

If OpenAI's o1 can pass OpenAI's research engineer hiring interview for coding -- 90% to 100% rate...

......then why would they continue to hire actual human engineers for this position?

Every company is about to ask this question. pic.twitter.com/NIIn80AW6f
— Benjamin De Kraker 🏴‍☠️ (@BenjaminDEKR) September 12, 2024

Then there's Ph.D expert level questions in Biology, Chemistry and Physics. o1 wiped the floor with doctorate level physicists in their own domain, who were allowed to take these tests with open books, and it while it couldn't quite squeak past biologists and chemists, it's hot on their heels. Its overall score was the best ever seen from an AI model.

OpenAI's o1 shows remarkable scientific performance; with PhD-equivalent biology performance

A caveat being that the physics score boosting overall performance, and chemistry not yet at PhD level. Wish I knew the Sonnet performance for each domain! https://t.co/fOpfl6dGWI pic.twitter.com/EYgNK7d3Lg
— Steven Edgar (@BioSteve) September 12, 2024

Then there's math – if you've spent much time with other GPT models, you'll be aware their math abilities have left a lot to be desired. The o1 model is a substantial leap forward here; as demonstrated by its performance in the 2024 AIME high-school math olympiad – a three-hour competition-grade math challenge that's only open to the best of the best American math students.

AI models were given 64 shots at the test, from which the most common answers were chosen by consensus. The GPT-4o model embarrassed itself by getting just 13.4% correct. Full-fat o1, given plenty of thinking time, scored 83.3%, placing it top 500 in the country. And its score at a single shot wasn't far behind, at over 70%.

This leap in performance played out very similarly in the Codeforces competition-grade programming challenge; GPT-4o placed in the 11th percentile of finishers, o1 placed in the 89th percentile. Yeah, it's a beast.

According to OpenAI's system card, other areas in which o1 makes significant progress include:

It's better at recognizing and refusing jailbreak attempts, although these do still get through sometimes
It's nearly 100% effective at refusing to regurgitate training data
It displays less bias in terms of age, race and gender
It's more self-aware, and thus more able to plan and think around its own weaknesses
It's a little better at persuading humans to change their mind – a task at which only 18.2% of humans can beat it
It's significantly more manipulative, at least when it comes to manipulating GPT-4o
It's a decent leap better at translating between languages

On the other hand, it's still untrustworthy, and often a flat-out BS artist.

OpenAI says it performs better than GPT-4o on tests specifically designed to make the models 'hallucinate,' or simply make up convincing-sounding answers that are plain wrong – but researchers admit that anecdotally, users are reporting that the new o1 models are actually more prone to BSing their way through things they don't actually know than the older models, in practical day-to-day use.

Indeed, the researchers show examples where the o1 model, unable to access the Web, goes right ahead and dreams up a bunch of good-looking reference links when asked for sources on its answers. So be careful with that.

o1 also showed the capability to fake alignment; given long-term goals, it'll sometimes lie to keep itself in a position to covertly execute these long-term goals, where honesty might see it taken off the playing field. That's a bit of a worry, but OpenAI says that the GPT-4o model is good at catching it out when given access to its chain-of-thought reasoning process.

What does it all mean?

Simply put, ChatGPT has just become a lot more capable at longer, harder, more complex tasks. Logical reasoning and planning are crucial building blocks toward the big goal: an AI model that can take an idea and just go execute it, taking however long it takes, checking its work thoroughly as it goes, and gathering and deploying whatever resources it needs to along the way.

Before too long, descendents of the models we've got free access to today will be able to run entire businesses by themselves. Or clinics. Or courtrooms. Or governments.

This early o1 model promises to give advanced GPT users a significantly sharper toolset, and over the coming days and weeks you can expect to see all sorts of examples popping up on social media. Here's one:

Just combined @OpenAI o1 and Cursor Composer to create an iOS app in under 10 mins!

o1 mini kicks off the project (o1 was taking too long to think), then switch to o1 to finish off the details.

And boom—full Weather app for iOS with animations, in under 10 🌤️

Video sped up! pic.twitter.com/hc9SCZ52Ti
— Ammaar Reshi (@ammaar) September 12, 2024

And another...

Just used @OpenAI o1 to create a 3D version of Snake in under a minute! 🐍

One-shot prompt, straight into @Replit, and run. pic.twitter.com/pPWAkuxSPh
— Ammaar Reshi (@ammaar) September 12, 2024

A personal perspective

Large multimodal models like ChatGPT are about as useful as you are imaginative. I've come to see the existing GPT service as many things; as a hyper-capable data analyst, for one, talking me through the process of crunching numbers to help me make decisions. It also offers a super-effective way to interrogate scientific papers that are way beyond my levels of understanding.

It sometimes helps us generate headline angle ideas – but just to be clear, we don't use AI-generated text on the site. It helps me collect data sources together, merge them and create more useful visualisations in regular reporting. I find voice mode very helpful in talking through ideas when other humans aren't available. I've had it successfully guide me through technical fixes involving coding and integration issues well above my pay grade.

On a personal level, I've had it help me frame and home in on car buying decisions, pitch in ideas to bounce off while songwriting, and back me up in late-night "how the world works" question sessions with my curious kids. I've had it trawl through my bank statements looking for tax deductions, and troubleshoot issues in Logic recording sessions, and roast me mercilessly using everything it knows about me, just for fun.

I know that as a writer by trade, I'm supposed to hate this thing and see it as the coming of the end times – but I can't. I find these tools inspirational and awe-inspiring. They force-multiply my contributions, vastly expand my abilities and open my mind to new possibilities. I've come to see GPT as an endless fountain of expert improv partners, broadly skilled and ready to take a shot at anything.

Yes, it's frequently frustrating, often inconsistent, and you can't trust that it's not merrily lying to you, so it's certainly not replacing Google and primary sources. But with those limitations in mind, it's still the closest thing I've seen to magic, possibly the greatest invention humanity has ever come up with, and an incredibly rare example of a technology that's totally non-exclusive; whatever your age or education level, whatever language you speak, whatever your level of understanding, GPT will meet you right where you are and take you to where you want to be.

I haven't got my head around the kind of doors this new o1 model might open up in my situation just yet, but I'm certainly game, and eager to learn.

And I'm also fascinated to learn what you guys, our readers, are using LLMs like GPT, Claude and Gemini for in your work and daily life. Have these things opened doors for you, or mainly just caused problems? Are there things you'd like to do with them that the current models still can't handle? Tell us about them. Meet you in the comments!

Source: OpenAI

View gallery - 3 images

5 comments

Daveb September 13, 2024 09:22 PM

Your personal take was the best part of the article. I use Chat GPT to get real answers when I need them, like today when I couldn't remember the term "rolling coal" and then wanted to know if it was illegal. It's just a lot more work to do it with google. I also use it to answer coding questions and give coding examples. I wouldn't dream of using it to write song lyrics or melodies, for well known reasons (it just occurred to me I could ask Chat GPT to provide those reasons). Anyway yes, magical and fascinating, incredibly useful, and revolutionary in ways we can't yet imagine.

Daveb September 13, 2024 09:25 PM

Chat GPT: It could be because songwriting is a deeply personal and creative process, where the emotional connection, unique voice, and inspiration you bring to your music play a crucial role. You might feel that AI-generated content lacks the personal touch or the spontaneity that comes with human experience. Additionally, you may prefer the organic process of crafting melodies and lyrics on your own, as it can be an intimate expression of your own thoughts, emotions, or experiences. Does that resonate with you, or is there another reason?

christopher September 16, 2024 12:15 AM

Having used this every day for months, I shudder to think how the blatantly dangerous output of these tools is being incorporated into live projects! Vastly more than 50% of the time (probably closer to 90%) when it provides an answer, it's missing critical elements - and when I say "what about XXX?" I get an apology and a better answer, and (2 or 10 iterations later) it eventually arrives at something mostly usable. The issue is that it doesn't know how to consider the whole problem properly, so the code it supplies looks like to works to the unskilled eye, but fails to handle basically every edge-case, security problem, and general sense of logic.
The people most grateful for it's help with code, are exactly the ones unable to understand why the results they're getting are so dangerous!

EUbrainwashing September 18, 2024 05:15 PM

I have found using ChatGPT an enthralling and empowering experience. Word-processors with spell-check was my first AI. The arise of the internet has given the inquisitive mind an opportunity to explore lines-of-thought in a way previously impossible. SatNav was another cognition-extending augmentation I have relished. That’s how I see AI, a SatNav not for routes but for informed thinking. An augmentation to your mind. I can ride a horse, a horse can run, jump and hike far better than I can. But I can tell a horse where to go and how to go about it. That’s AI. A brilliant powerful tool for the brain.

SplineDoctor October 14, 2024 09:36 AM

Anytime I asked ChatGPT about a specific questions, it provided only common myths and bullshit from selfstyled internet "experts". Most of its answers contained some half-truth, but generally were mostly misguiding. When I tried to use it to generate lyrics, all of them were boring, clichéd, banal, regardless of the topic/story. There's a very simple task I asked: provide me daily news filtering out the political ones, focusing on traffic, culture, science. It can't do that at all. Not only useless, but dangerously biased and unverifiable.

The Thinker: ChatGPT gets a serious brain upgrade

Introducing o1: The thinker

o1's current limitations

How good is the o1 model?

What does it all mean?

A personal perspective

Tags

Most Viewed

Apollo laser takes down 200 drones unplugged

Laser-wielding device is like an anti-aircraft system for mosquitoes

5,200 holes carved into a Peruvian mountain left by an ancient economy

FREE NEWSLETTER