Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one’s voice, caused by physical or neurological conditions, can result in a profound sense of loss, striking at the very heart of one’s identity. Speakers with degenerative neural diseases, such as amyotrophic lateral sclerosis (ALS), Parkinson’s, and multiple sclerosis, may experience a degradation of some of the unique characteristics of their voice over time. Some individuals are born with conditions, like muscular dystrophy, that affect the articulatory system and limit their ability to produce certain sounds. Profound deafness also impacts vocal and articulatory patterns due to the absence of auditory input and feedback. These conditions present lifelong challenges in matching the typical speech heard widely.
In recent years, there have been new advances in voice transfer (VT) technology, integrated in text-to-speech (TTS), voice conversion (VC), and speech-to-speech translation models. For example, in our previous work, we built a VC model that converts atypical speech directly to a synthesized predetermined typical voice that can be more easily understood by others. Yet for many individuals with dysarthria, VT extends speech technologies to help them regain their original voice and potentially predict speech patterns they have lost.
A VT module can be designed for a given speaker using either few- or zero-shot training. In few-shot training for VT, a sample of speech from a given speaker is used to adapt a pre-trained model to transfer or clone their voice. This approach typically produces high quality speech with high speaker-voice fidelity, depending on the amount and quality of the training samples. A more challenging approach is zero-shot, which does not require training, but rather feeds audio reference samples (e.g., 10 seconds) from a given speaker to the system during generation, to transfer their voice into the output synthesized speech. These systems vary significantly in their quality and do not guarantee to produce high fidelity voices to the reference voice. Few-shot approaches can be effective for those speakers who once had typical speech and have banked a set of high quality samples of their voice before an etiology has progressed (or a physical injury has occurred). On the other hand, zero-shot is more appropriate for those dysarthric speakers who have not banked sufficient samples of their voice or have never had a typical voice. Moreover, a zero-shot system can be easily scaled and deployed.
In this blogpost, we describe a zero-shot VT module that can be easily plugged into a state-of-the-art TTS system to restore the voices of input speakers. It can be used both when speakers have banked a small set of their voice or when atypical speech is the only data available. We add this module to our TTS system and use it to restore the voices of speakers who banked their typical speech. We also show that the same model produces high quality speech with high fidelity voice preservation even when the input reference is atypical, useful for those who have not banked their voice or never had typical speech. Finally, we demonstrate that such a module is capable of transferring voice across languages, even though the language of the input reference speech is different from the intended target language.