GPT-4 becomes 30% more accurate when asked to critique itself

April 03, 2023

GPT-4 can significantly improve its performance by designing and executing tests to critique its own performance, as shown here with its results on the AlfWorld test

Northwestern University/MIT

View 2 Images

1/2

GPT-4 can significantly improve its performance by designing and executing tests to critique its own performance, as shown here with its results on the AlfWorld test

Northwestern University/MIT

2/2

On the HumanEval coding test, GPT-4 went from 67% to 88% accuracy, an impressive leap, using self-reflective loops

Northwestern University/MIT

Even if the unlikely six-month moratorium on AI development goes ahead, it seems GPT-4 has the capability for huge leaps forward if it just takes a good hard look at itself. Researchers have had GPT critique its own work for a 30% performance boost.

"It’s not everyday that humans develop novel techniques to achieve state-of-the-art standards using decision-making processes once thought to be unique to human intelligence," wrote researchers Noah Shinn and Ashwin Gopinath. "But, that’s exactly what we did."

The "Reflexion" technique takes GPT-4's already-impressive ability to perform various tests, and introduces "a framework that allows AI agents to emulate human-like self-reflection and evaluate its performance." Effectively, it introduces extra steps in which GPT-4 designs tests to critique its own answers, looking for errors and missteps, then rewrites its solutions based on what it's found.

The team used its technique against a few different performance tests. In the HumanEval test, which consists of 164 Python programming problems the model has never seen, GPT-4 scored a record 67%, but with the Reflexion technique, its score jumped to a very impressive 88%.

In the Alfworld test, which challenges an AI's ability to make decisions and solve multi-step tasks by executing several different allowable actions in a variety of interactive environments, the Reflexion technique boosted GPT-4's performance from around 73% to a near-perfect 97%, failing on only 4 out of 134 tasks.

In another test called HotPotQA, the language model was given access to Wikipedia, and then given 100 out of a possible 13,000 question/answer pairs that "challenge agents to parse content and reason over several supporting documents." In this test, GPT-4 scored just 34% accuracy, but GPT-4 with Reflexion managed to do significantly better with 54%.

A Self-Reflecting LLM Agent

Equips LLM-based agent w/
-dynamic memory
-a self-reflective LLM
-a method for detecting hallucinations

Challenge agent to learn from its own mistakes

-Evaluate on knowledge-intensive tasks
-Outperforms ReAct agents

Paper: https://t.co/URsJWbkwmj pic.twitter.com/WfNcPQvIs6
— John Nay (@johnjnay) March 23, 2023

More and more often, the solution to AI problems appears to be more AI. In some ways, this feels a little like a generative adversarial network, in which two AIs hone each other's skills, one trying to generate images, for example, that can't be distinguished from "real" images, and the other trying to tell the fake ones from the real ones. But in this case, GPT is both the writer and the editor, working to improve its own output.

Very neat!

The paper is available at Arxiv.

Source: Nano Thoughts via AI Explained

3 comments

Speden Spelit April 3, 2023 07:13 AM

Hopefully this will help with the problem of asking for a coding solution and it gives back code that won't even compile. If it could try to compile its proposed solution before giving it, then it should help it improve and give better answers.

dave be April 3, 2023 12:45 PM

yea most of the 'I made this with GPT' is done with non-compiled type-less languages.
Programming with a strongly-typed and compiled language you really get shown how poor the results are. These are good for doing the bulk work but it takes someone who knows what they're doing (in this case a compiler makes a lot more obvious) to see the problems in design the AIs come up with. It makes sense as they are often training these things on 'all github' and suchlike which have as much (or more) bad code as good. So its learning the problems as its truth as much as the successes.
But the answers they come up with for uncompiled langages look good, and are formatted right, and may even run. But that doesn't mean variables are the right type, data is going where it should, or that the whole thing wont fall apart the moment after the vid we watched finished filming.

clay April 9, 2023 09:17 AM

It would be great to actually see some practical suggestions that us mortals can apply. Not everything is python programming.. but indeed, GPT~3/3.5/4 all suck in this area. It won't be taking over the world any time soon ...intentionally.

GPT-4 becomes 30% more accurate when asked to critique itself

Tags

Most Viewed

France runs fusion reactor for record 22 minutes

Laser-wielding device is like an anti-aircraft system for mosquitoes

World's largest deposit holds 99.999% of all gold on Earth

FREE NEWSLETTER