AI & Humanoids

Next-level AI engine comes top in LLM speed showdown

View 5 Images
The Groq Language Processing Unit Inference Engine should significantly speed up response times from AI chatbots
The Groq Language Processing Unit Inference Engine should significantly speed up response times from AI chatbots
Groq says that GPUs are the "weakest link in the generative AI ecosystem," and has developed the Language Processing Unit "to deliver substantial performance, efficiency, and precision all in a simple design"
Groq
Groq reports that the axes of the Latency vs. Througput chart from ArtificialAnalysis.ai had to be extended to plot the performance of the Language Processing Unit Inference Engine
Groq
A clear winner for throughput benchmarks
Groq
The Groq Language Processing Unit Inference Engine delivered 241 tokens per second, and took 0.8 seconds to deliver 100 tokens
Groq
View gallery - 5 images

Responses to AI chat prompts not snappy enough? California-based generative AI company Groq has a super quick solution in its LPU Inference Engine, which has recently outperformed all contenders in public benchmarks.

Groq has developed a new type of chip to overcome compute density and memory bandwidth issues and boost processing speeds of intensive computing applications like Large Language Models (LLM), reducing "the amount of time per word calculated, allowing sequences of text to be generated much faster."

This Language Processing Unit is an integral part of the company's inference engine, which processes information and provides answers to queries from an end user, serving up as many tokens (or words) as possible for super quick responses.

Late last year, inhouse testing "set a new performance bar" by achieving more than 300 tokens per second per user through the Llama-2 (70B) LLM from Meta AI. In January 2024, the company took part in is first public benchmarking – leaving all other cloud-based inference providers in its performance wake. Now it's emerged victorious against the top eight cloud providers in independent tests.

Groq reports that the axes of the Latency vs. Througput chart from ArtificialAnalysis.ai had to be extended to plot the performance of the Language Processing Unit Inference Engine
Groq

"ArtificialAnalysis.ai has independently benchmarked Groq and its Llama 2 Chat (70B) API as achieving throughput of 241 tokens per second, more than double the speed of other hosting providers," said Micah Hill-Smith, co-creator at ArtificialAnalysis.ai. "Groq represents a step change in available speed, enabling new use cases for large language models."

The Groq LPU Inference Engine came out on top for such things as total response time, throughput over time, throughput variance and latency vs. throughput – with the chart for the lattermost category needing to have its axes extended to accommodate the results.

The Groq Language Processing Unit Inference Engine delivered 241 tokens per second, and took 0.8 seconds to deliver 100 tokens
Groq

"Groq exists to eliminate the 'haves and have-nots' and to help everyone in the AI community thrive," said Groq CEO and founder, Jonathan Ross. "Inference is critical to achieving that goal because speed is what turns developers' ideas into business solutions and life-changing applications. It is incredibly rewarding to have a third party validate that the LPU Inference Engine is the fastest option for running Large Language Models and we are grateful to the folks at ArtificialAnalysis.ai for recognizing Groq as a real contender among AI accelerators."

You can try the company's LPU Inference Engine for yourself through the GroqChat interface, though the chatbot doesn't have access to the internet. Early access to the Groq API is also available to allow approved users to put the engine through its paces via Llama 2 (70B), Mistral and Falcon.

Source: Groq

View gallery - 5 images
  • Facebook
  • Twitter
  • Flipboard
  • LinkedIn
2 comments
Daishi
I follow this space pretty close and I didn't even know who Groq was 2 weeks ago. Running LLM's (inference) requires a lot less precision than training them. There is a lot of cost and complexity to building high performance computing clusters to train LLM's but it looks like inference workloads could potentially get moved to something like Groq Language Processing Units (LPU's) for a fraction of existing costs. I am sure Groq is making some waves in the AI industry now.
Daishi
Groq was created by Jonathan Ross who co-created the TPU (Tensor Processing Unit) as a 20% project at Google. Impressive resume and it seems like they are likely onto something big. This benchmark focuses on Llama 2 (70B) Which was released by Meta last year but the same site did a similar breakdown of Mixtral 8x7B which is better than Lama 2 in a few ways. With Mixtral token throughput goes from 246.8/s to 428/s average. Blended token pricing goes from $0.72/million to $0.27/million. For context OpenAI charges between $1/million tokens (GPT 3.5 turbo) and $20/million tokens (GPT4 turbo). Nothing in the industry is beating GPT 4 turbo in overall capability yet but Mixtral 8x7B is more capable than the other models (llama 2 70B and GPT 3.5 turbo).