Researchers at the Norwegian University of Science and Technology (NTNU) are combining two of the best-known approaches to automatic speech recognition to build a better and language-independent speech-to-text algorithm that can recognize the language being spoken in under a minute, transcribe languages on the brink of extinction, and make the dream of ever present voice-controlled electronics just a little bit closer.

The exponential yearly improvements in processing power we are seeing give hope that we are quickly moving toward superbly accurate and responsive speech recognition – and yet, things aren't quite that simple. Even though this technology is slowly making its way into our phones, tablets and personal computers, it'll still be some time before keyboards disappear from our digital lives altogether.

Achieving accurate, real-time speech recognition is no easy feat. Even assuming that the sound acquired by a device can be completely stripped of background noise (which isn't always the case), there is hardly a one-to-one correspondence between the waveform detected by a microphone and the phoneme being spoken. Different people speak the same language with different nuances – accents, lisps and other articulation defects. Other factors such as age, gender, health and education also play a big role in altering the sound that reaches the microphone.

In other words, faster processors alone are useless, because we also need a robust plan of action to use all that number-crunching power the right way – with efficient, reliable computer algorithms that can figure out how to see through the incredible variety of sounds that can come out of our mouths and accurately transcribe what we are saying.

The NTNU researchers are now pioneering an approach that, if it can be fully exploited, may lead to a big leap in the performance of speech-to-text applications. They demonstrated that the mechanics of human speech are fundamentally the same across all people and across all languages, and they are now training a computer to analyze the pressure of sound waves captured by the microphone to determine which parts of the speech organs were used to produce a phoneme.

Many of the most successful speech recognition software available today asks users to provide personal information about themselves, including age group and accent, before they even attempt to transcribe human speech for the first time. When creating a new profile, users are also often asked to read some text to first calibrate the software parameters.

This is because speech recognition software often uses data fed by users to continuously improve its accuracy. It often uses probabilistic tools – namely, Bayesian inference – to estimate the probability of a certain sound being spoken given the user's speech patterns that it has learned over time. This means the quality of the transcripts can sensibly improve after the program has collected a critical amount of data on the user. On the flip-side, speech recognition may not be too accurate right after a new user profile has been created.

An alternative to the statistical approach described above is to have humans study sounds, words and sentence structure for a given language and deduce rules which are then implemented into the software. For instance, different phonemes show different resonant frequencies, and the typical ranges for these frequencies can be programmed into the software to help it detect the sound more accurately.

The system developed at NTNU is a blend of the two approaches: it collects data to learn about the user's speech nuances and improve accuracy over time but, crucially, it also incorporates a rule-based approach that is based on phonetics – the study of the sounds of human speech.

Detecting the pressure of sound waves on the microphone could mean achieving higher accuracy than was previously possible. As an example, sounds can be classified as voiced (in which vocal cords vibrate) and voiceless (in which they do not). The analysis of the pressure of sound waves on the microphone can detect the vibration of the vocal cords directly rather than deducing it from the peak frequencies captured by the microphone.

Because the anatomy of speech is the same across all humans, one of the strengths of the system is that it is completely language-independent. Therefore, unlike previous approaches, it can be easily adapted to a new language without much work at all, opening the door to idioms spoken by minor groups for which a commercial speech-to-text software isn't a viable solution.

The team is now looking to develop a language-independent module that they can use to design competitive speech recognition products. Such software could also do very well transcribing text in more than one language as, the researchers say, it only takes the system 30 to 60 seconds to identify a given spoken language.