Microsoft researchers have hit a milestone 25 years in the making. The company's conversational speech recognition system has finally reached an error rate of only 5.1 percent, putting it on par with the accuracy of professional human transcribers for the first time ever.
A year ago, the Microsoft's speech and dialog research group refined its system to reach a 5.9 percent word error rate. This was generally considered to be the average human error rate, but further work by other researchers suggested that 5.1 percent was closer to the mark for humans professionally transcribing speech heard in a conversation.
For over 20 years, a collection of recorded phone conversations known as Switchboard has been used to test speech recognition system for accuracy. This is done by tasking either humans or a machine to transcribe recorded telephone conversations between strangers on topics including politics and sport.
To reduce the system's error rate by about 12 percent from last year's benchmark results, the team incorporated a series of improvements into its neural net-based acoustic and language models. On top of general upgrades to all components of the system, the model's vocabulary size was increased from about 30,000 words to 165,000.
Most significantly, the researchers incorporated what they called "dialog session-based long-short-term memory." In simple terms, this means the new language model allows the system to use the entire preceding conversation as history when trying to clearly determine specific phrases. This allows the system to recognize if a conversation is talking about sport, for example, and take that into account as it weighs up potential translations for a phrase.
The team notes that there is still much work to do in the speech recognition field and this latest breakthrough doesn't cover more complex tasks, such as recognizing speech in loud environments or deciphering strongly accented speech.
"Moreover, we have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent," writes Microsoft Technical Fellow, Xuedong Huang. "Moving from recognizing to understanding speech is the next major frontier for speech technology."
Microsoft's speech recognition systems are currently used in services such as Cortana and Speech Translator and the paper detailing the latest version can be viewed here (PDF).
Source: Microsoft Research
Then we might finally see a decently accurate realtime CC dialogue on screen.
I don't know if you've ever watched real time CC for news or any telecast, but it's pretty bad. Even subtitled movies are badly transposed let alone translated.
Something that works and works well with a error rate of .0001 at speed with a low delay factor would be welcomed by many.