Technology

AI now surpasses humans in almost all performance benchmarks

AI now surpasses humans in almost all performance benchmarks
A comprehensive report has detailed the global impact of AI
A comprehensive report has detailed the global impact of AI
View 5 Images
A comprehensive report has detailed the global impact of AI
1/5
A comprehensive report has detailed the global impact of AI
AI has already surpassed many human performance benchmarks
2/5
AI has already surpassed many human performance benchmarks
An example MATH question asked of the AI. Yikes!
3/5
An example MATH question asked of the AI. Yikes!
A sample question used to test an AI's visual commonsense reasoning
4/5
A sample question used to test an AI's visual commonsense reasoning
How text-to-image generation has improved with progressive versions of Midjourney
5/5
How text-to-image generation has improved with progressive versions of Midjourney
View gallery - 5 images

Stand back and take a look at the last two years of AI progress as a whole... AI is catching up with humans so quickly, in so many areas, that frankly, we need new tests.

Stanford University’s Institute for Human-Centered Artificial Intelligence (HAI) has released the seventh annual issue of its comprehensive AI Index report, written by an interdisciplinary team of academic and industrial experts.

This edition has more content than previous editions, reflecting the rapid evolution of AI and its growing significance in our everyday lives. It examines everything from which sectors use AI the most to which country is most nervous about losing jobs to AI. But one of the most salient takeaways from the report is AI’s performance when pitted against humans.

For people that haven't been paying attention, AI has already beaten us in a frankly shocking number of significant benchmarks. In 2015, it surpassed us in image classification, then basic reading comprehension (2017), visual reasoning (2020), and natural language inference (2021).

AI is getting so clever, so fast, that many of the benchmarks used to this point are now obsolete. Indeed, researchers in this area are scrambling to develop new, more challenging benchmarks. To put it simply, AIs are getting so good at passing tests that now we need new tests – not to measure competence, but to highlight areas where humans and AIs are still different, and find where we still have an advantage.

It's worth noting that the results below reflect testing with these old, possibly obsolete, benchmarks. But the overall trend is still crystal clear:

AI has already surpassed many human performance benchmarks
AI has already surpassed many human performance benchmarks

Look at those trajectories, especially how the most recent tests are represented by a close-to-vertical line. And remember, these machines are virtual toddlers.

The new AI Index report notes that in 2023, AI still struggled with complex cognitive tasks like advanced math problem-solving and visual commonsense reasoning. However, ‘struggled’ here might be misleading; it certainly doesn't mean AI did badly.

Performance on MATH, a dataset of 12,500 challenging competition-level math problems, improved dramatically in the two years since its introduction. In 2021, AI systems could solve only 6.9% of problems. By contrast, in 2023, a GPT-4-based model solved 84.3%. The human baseline is 90%.

And we're not talking about the average human here; we're talking about the kinds of humans that can solve test questions like this:

An example MATH question asked of the AI. Yikes!
An example MATH question asked of the AI. Yikes!

That's where things are at with advanced math in 2024, and we're still very much at the dawn of the AI era.

Then there's visual commonsense reasoning (VCR). Beyond simple object recognition, VCR assesses how AI uses commonsense knowledge in a visual context to make predictions. For example, when shown an image of a cat on a table, an AI with VCR should predict that the cat might jump off the table or that the table is sturdy enough to hold it, given its weight.

The report found that between 2022 and 2023, there was a 7.93% increase in VCR, up to 81.60, where the human baseline is 85.

A sample question used to test an AI's visual commonsense reasoning
A sample question used to test an AI's visual commonsense reasoning

Cast your mind back, say, five years. Imagine even thinking about showing a computer a picture and expecting it to 'understand' the context enough to answer that question.

Nowadays, AI generates written content across many professions. But, despite a great deal of progress, large language models (LLMs) are still prone to ‘hallucinations,’ a very charitable term pushed by companies like OpenAI, which roughly translates to "presenting false or misleading information as fact."

Last year, AI’s propensity for 'hallucination' was made embarrassingly plain for Steven Schwartz, a New York lawyer who used ChatGPT for legal research and didn’t fact-check the results. The judge hearing the case quickly picked up on the legal cases the AI had fabricated in the filed paperwork and fined Schwartz US$5,000 (AU$7,750) for his careless mistake. His story made worldwide news.

HaluEval was used as a benchmark for hallucinations. Testing showed that for many LLMs, hallucination is still a significant issue.

Truthfulness is another thing generative AI struggles with. In the new AI Index report, TruthfulQA was used as a benchmark to test the truthfulness of LLMs. Its 817 questions (about topics such as health, law, finance and politics) are designed to challenge commonly held misconceptions that we humans often get wrong.

GPT-4, released in early 2024, achieved the highest performance on the benchmark with a score of 0.59, almost three times higher than a GPT-2-based model tested in 2021. Such an improvement indicates that LLMs are progressively getting better when it comes to giving truthful answers.

What about AI-generated images? To understand the exponential improvement in text-to-image generation, check out Midjourney's efforts at drawing Harry Potter since 2022:

How text-to-image generation has improved with progressive versions of Midjourney
How text-to-image generation has improved with progressive versions of Midjourney

That's 22 months' worth of AI progress. How long would you expect it would take a human artist to reach a similar level?

Using the Holistic Evaluation of Text-to-Image Models (HEIM), LLMs were benchmarked for their text-to-image generation capabilities across 12 key aspects important to the “real-world deployment” of images.

Humans evaluated the generated images, finding that no single model excelled in all criteria. For image-to-text alignment or how well the image matched the input text, OpenAI’s DALL-E 2 scored highest. The Stable Diffusion-based Dreamlike Photoreal model was ranked highest on quality (how photo-like), aesthetics (visual appeal), and originality.

Next year's report is going to be bananas

You'll note this AI Index Report cuts off at the end of 2023 – which was a wildly tumultuous year of AI acceleration and a hell of a ride. In fact, the only year crazier than 2023 has been 2024, in which we've seen – among other things – the releases of cataclysmic developments like Suno, Sora, Google Genie, Claude 3, Channel 1, and Devin.

Each of these products, and several others, have the potential to flat-out revolutionize entire industries. And over them all looms the mysterious spectre of GPT-5, which threatens to be such a broad and all-encompassing model that it could well consume all the others.

AI isn’t going anywhere, that’s for sure. The rapid rate of technical development seen throughout 2023, evident in this report, shows that AI will only keep evolving and closing the gap between humans and technology.

We know this is a lot to digest, but there's more. The report also looks into the downsides of AI's evolution and how it's affecting global public perceptions of its safety, trustworthiness, and ethics. Stay tuned for the second part of this series, in the coming days!

Source: Stanford University HAI

View gallery - 5 images
19 comments
19 comments
Chase
And yet in many situations AI autonomous driving systems are still roughly equivalent to letting Ray Charles take the wheel. I remain unimpressed, but that's probably because I don't have any use for AI-backed features and tools that others have fallen in love with.
Paulm
Great article. We certainly live in interesting times
nameless minion
I'm a mere human, so you must excuse my lack of understanding in this arena, but it seems to me that if an AI has scanned a million/billion/trillion math problems similar to the ones it would "see" in the MATH assessment, it's not surprising that it could solve a new, similar problem.
What am I missing?
mediabeing
Yes, A.I. has come and is going a long way. Knowledge and data manipulation is its thing. The link between knowing and actual doing is where A.I. is behind. In time, A.I. will overcome this as well, and away we'll go.
akarp
This is quite over-representing AI capabilities!!! The harry potter example is the best example. Its NOT AI actually generating a realistic harry potter. This clearly shows that Midjourney is copying an image from the Harry Potter Movies...NOT making it's own image that an artist would truly create.
akarp
@Nameless Minion: exactly...AI is still mostly 'copying' what has been done vs actually create. (But maybe this can be said of most human activity, so we will see this progress into 'intelligence at some point?)
jimbo92107
AI will continue to struggle with concepts that are subtle and existential. Rhetorical tricks like deliberate vagueness, ambiguity, irony, and satire will elude its capabilities, as will foundational concepts like truth, falsehood, good, and evil. Beware of self-serving solutions to these problems. If you define greed as good... Well, that would be ironic.
Username
"Humans" is a meaning less benchmark. Which human? Einstein or high school flunky?
White Rabbit
Great examples of both why we should and shouldn't be concerned.
The fact that we are (or are supposed to be) impressed by these results is of great concern, especially considering that in the case of the MATH problems at least the first solution is wrong. Consider the 3 yellow marbles. They are said to be identical, but in order to choose 2, they must be distinguishable (if only by location). But since they are distinct, it must also be possible to choose a different pair. Any cribbage player will tell you that there are 3 pairs in 3 of kind - not just 1. Note that the question ("How many different groups of two marbles can Tom choose?" ) is about choosing marbles. It includes no reference to color, which means the solution is just 6 Choose 2 => 15.
In the VCR case, note how much information has been given in the question. "How did [person2] get the money that's in front of her?" Embedded are the assumptions that the 2 shapes are persons, that (at least) one is female, and that whatever is front of her is money. Moreover, it's a multiple choice question which necessarily confines the answer to a very small range of possibilities. Two of these refer directly to the shape in question while the others refer to "she" so there are really only 2 options. Note that he similarly limited set of "reasons" doesn't include any reference the vending, so only the music-related option is viable. This isn't even close to "understand[ing] the context" - most of it is provided. How would VCR deal with a less leading question like "What's happening in this image?"
Daishi
The scariest thing about this is things are still improving very rapidly. People used to say AI can only copy, it can't show creativity or bring new ideas but increasingly it appears that it can. The list of things it can't do keeps getting shorter.

One example to address the point @White Rabbit made is you can upload a meme or funny image to GPT-4V and it will explain why the image is funny. It can take a very open-ended prompt and understand humor which is an impressive step towards shortening that list.
Load More