Computers

Fujitsu gives speech synthesis a realism boost

Fujitsu gives speech synthesis a realism boost
Fujitsu's newly-developed speech synthesis system can produce voices to suit different situations
Fujitsu's newly-developed speech synthesis system can produce voices to suit different situations
View 3 Images
Usage scenarios for speech synthesis
1/3
Usage scenarios for speech synthesis
Overview of the newly-developed speech synthesis system from Fujitsu
2/3
Overview of the newly-developed speech synthesis system from Fujitsu
Fujitsu's newly-developed speech synthesis system can produce voices to suit different situations
3/3
Fujitsu's newly-developed speech synthesis system can produce voices to suit different situations
View gallery - 3 images

Speech synthesis has come a long way from the days when computers sounded like a Dalek with a cleft palate, but there’s still a lot of room for improvement. Fujitsu Laboratories Ltd. in Kawasaki, Japan are working to move computers away from sounding monotone or perpetually distracted by developing a new speech synthesis system that can quickly produce high quality voices that can be tailored to suit individual environments and circumstances.

Speech synthesis is one of those technologies that we tend to think of as futuristic, but a moment's reflection reveals it to be much more common than you think. It isn't just something for having a chat with your smartphone’s digital assistant app. It’s a widespread technology that’s used to help the vision impaired read text, provides emergency information that a simple alarm bell cannot, reads out traffic reports and travel information at airports and railway stations, features in museum self tours, allows us to interact with satnav systems, gives instructions to workers without their having to leave the job, and drives most people around the bend when using phone menus.

The problem is that speech synthesis is a matter of tradeoffs. You can make it work well, or you can make it work fast, and sometimes neither of those alone is enough. Whatever method of speech synthesis is used to change text or machine code into speech, ideally it needs to provide naturalness and intelligibility. That is, it needs to sound as if it’s coming from a human being instead of a machine, and it has to be understood by the listener.

Usage scenarios for speech synthesis
Usage scenarios for speech synthesis

Even if you manage all this properly, there’s still the sticking point that the result is often a flat monotone or something that just doesn't sound right. With so much conveyed not by what is said, but how it is said, an entire dimension of meaning is lost in the speech. The synthetic voice may be perfectly understandable, but it may lack a sense of reassurance or urgency. Or it may be that the programmer put in some sort of emotion in the voice, but it’s completely wrong for the situation. Who wants a cheery voice telling you that your bank account is overdrawn?

Another thing about speech synthesis is that it can go beyond putting the right emotional tone in a voice. Proper synthesis can produce speech that can be heard clearly in noisy environments, over long distances in open areas, or is more likely to catch the listener’s attention. However, to achieve any of this, it needs to go beyond simple adjustment of speed, volume, and pitch. It also needs a way of generating realistic voices without large libraries of pre-recorded speech or time-consuming algorithms.

Instead of using large libraries and manipulating synthetic speech by simply changing speed, pitch, and volume, Fujitsu took a more flexible approach by breaking down the synthesis into the basic components of speech, then setting parameters for these components, such as intonation and pauses, that can be quickly and easily manipulated so an algorithm can create speech in 1/30th the time of previous methods.

Overview of the newly-developed speech synthesis system from Fujitsu
Overview of the newly-developed speech synthesis system from Fujitsu

It’s a bit like the difference between writing in Chinese characters and the Latin alphabet. Both can convey meaning, but the Latin alphabet is more flexible and uses a much less complex keyboard. For speech synthesis, the Fujitsu method uses a smaller library of speech sounds, but uses an algorithm that employs machine learning to custom tailor the voice output.

What all this means is that the Fujitsu system can adapt synthetic speech to particular circumstances. An alarm message can sound urgent, or be designed to pierce a noisy factory floor, or be heard clearly across an airfield. It can also be used to be reassuring, match the image that a company wants to present to the public, or customize an avatar's voice so it meets the user’s preference. Fujitsu says that since the system uses only a small number of recorded speech patterns, the technology can be adapted to portable devices, such as those used in medicine, or for recreating a person’s voice after its lost due to surgery or accident.

Fujitsu Laboratories says that it hopes to develop a practical version of its speech synthesis system this year.

Here is an example (wav file) showing the range of the Fujitsu speech synthesizer.

Source: Fujitsu Laboratories

View gallery - 3 images
2 comments
2 comments
Mel Tisdale
The main component that needs to be correct is that the time period - either the past or the non-past - that the verb relates to has to be clear from the context or intonation pattern. If that time period matches the tense - again either the past or the non-past, then the verb can be taken at face value. If, however, it does not match, then a much deeper analysis is required before the true meaning of the verb can be determined by the listener, cf. "I wish I did not know now what I did not know then" (Against the Wind - Bob Seger). Two seemingly identical verbs that have completely opposite meanings. If they don't get that right, then the listener is going to have to analyse each verb for its intended meaning and in the process it will sound strange to them.
This feature of the language is one of the main reasons learners believe that a foreign language is spoken quicker than their native tongue is spoken.
Daniel Yen
When it comes to natural sounding artificial voices, how close are we getting to those Digitized Professionals sounding .. well natural?