Microsoft's speech recognition system is now as good as a human

By Rich Haridy

August 22, 2017

For the first time, a speech recognition system has achieved accuracy on par with a human

olly18/Depositphotos

View 2 Images

1/2

Speech system milestones over the past 45 years

2/2

For the first time, a speech recognition system has achieved accuracy on par with a human

olly18/Depositphotos

Microsoft researchers have hit a milestone 25 years in the making. The company's conversational speech recognition system has finally reached an error rate of only 5.1 percent, putting it on par with the accuracy of professional human transcribers for the first time ever.

A year ago, the Microsoft's speech and dialog research group refined its system to reach a 5.9 percent word error rate. This was generally considered to be the average human error rate, but further work by other researchers suggested that 5.1 percent was closer to the mark for humans professionally transcribing speech heard in a conversation.

For over 20 years, a collection of recorded phone conversations known as Switchboard has been used to test speech recognition system for accuracy. This is done by tasking either humans or a machine to transcribe recorded telephone conversations between strangers on topics including politics and sport.

To reduce the system's error rate by about 12 percent from last year's benchmark results, the team incorporated a series of improvements into its neural net-based acoustic and language models. On top of general upgrades to all components of the system, the model's vocabulary size was increased from about 30,000 words to 165,000.

Most significantly, the researchers incorporated what they called "dialog session-based long-short-term memory." In simple terms, this means the new language model allows the system to use the entire preceding conversation as history when trying to clearly determine specific phrases. This allows the system to recognize if a conversation is talking about sport, for example, and take that into account as it weighs up potential translations for a phrase.

The team notes that there is still much work to do in the speech recognition field and this latest breakthrough doesn't cover more complex tasks, such as recognizing speech in loud environments or deciphering strongly accented speech.

"Moreover, we have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent," writes Microsoft Technical Fellow, Xuedong Huang. "Moving from recognizing to understanding speech is the next major frontier for speech technology."

Microsoft's speech recognition systems are currently used in services such as Cortana and Speech Translator and the paper detailing the latest version can be viewed here (PDF).

Source: Microsoft Research

3 comments

chase August 22, 2017 12:32 PM

It would be nice if the error rate surpassed that of humans by a considerable margin.
Then we might finally see a decently accurate realtime CC dialogue on screen.
I don't know if you've ever watched real time CC for news or any telecast, but it's pretty bad. Even subtitled movies are badly transposed let alone translated.
Something that works and works well with a error rate of .0001 at speed with a low delay factor would be welcomed by many.

Rann Xeroxx August 22, 2017 01:55 PM

I know Google has been getting better and better over the years as this is how I text.

guzmanchinky August 22, 2017 02:08 PM

I can't wait for the day when languages are obsolete, a translator that works in real time with perfect accuracy...

Microsoft's speech recognition system is now as good as a human

Tags

Most Viewed

Toyota and Lexus no longer most reliable carmakers, says Consumer Reports

France runs fusion reactor for record 22 minutes

World's largest deposit holds 99.999% of all gold on Earth

FREE NEWSLETTER