The neural network has learned to replicate the human voice

Last year, DeepMind, an artificial intelligence company, shared details about its new WaveNet project, a neural network of in-depth training used to synthesize realistic human speech. The other day, an improved version of this technology was released, which will be used as the basis of the digital mobile assistant Google Assistant.

The voice synthesizing system (also known as text-to-speech, TTS) is usually built on one of two main methods. A concatenative (or compilation) method involves constructing phrases by collecting individual pieces of recorded words and parts pre-recorded with the actor’s voice acting. The main drawback of this method is the need to constantly replace the sound library whenever there are any updates or changes.

Another method is called parametric TTS, and its peculiarity is the use of parameter sets, with which the computer generates the desired phrase. The disadvantage of the method is that most often the result is manifested in the form of unrealistic or so-called robotic sounding.

As for WaveNet, it produces sound waves from scratch based on a system based on a convolutional neural network, where sound generation occurs in several layers. First, to train the platform for synthesizing “live” speech, it “feeds” a huge amount of samples, while noting which sound signals sound realistic and which do not. This gives the voice synthesizer the ability to reproduce naturalistic intonation and even such details as smacking sounds with your lips. Depending on what patterns are being spoken through the system, it allows it to develop a unique “accent”, which in the future can be used to create a multitude of different voices.

Sharp in tongue

Perhaps the biggest limitation of the WaveNet system was that it required a huge amount of processing power for its operation, and even when this condition was met, it did not differ in speed. For example, to generate 0.02 seconds of a sound, it took about 1 second of time.

After a year of work, DeepMind engineers still found a way to improve and optimize the system in such a way that now it is capable of producing a raw sound lasting one second only by 50 milliseconds, which is 1000 times faster than its original capabilities. Moreover, the specialists managed to increase the sampling frequency of sound from 8-bit to 16-bit, which had a positive impact on the tests involving listeners. Thanks to these successes, for WaveNet, the road to integration into consumer products like Google Assistant was opened.

At the moment, WaveNet can be used to generate English and Japanese voices through Google Assistant and all platforms where this digital assistant is used. Since the system can create a special type of voices depending on which set of samples it was provided for training, in the near future Google will most likely introduce WaveNet support for the synthesis of realistic speech in other languages, including taking into account their local dialects.

Speech interfaces are becoming more and more common on a variety of platforms, but their pronounced unnatural nature of sound repels many potential users. DeepMind’s attempts to improve this technology will certainly contribute to the wider distribution of such voice systems, and will also improve the user experience from their use.