Zero-shot mono-to-binaural speech synthesis


Humans possess a remarkable ability to localize sound sources and perceive the surrounding environment through auditory cues alone. This sensory ability, known as spatial hearing, plays a critical role in numerous everyday tasks, including identifying speakers in crowded conversations and navigating complex environments. Hence, emulating a coherent sense of space via listening devices like headphones becomes paramount to creating truly immersive artificial experiences. Due to the lack of multi-channel and positional data for most acoustic and room conditions, the robust and low- or zero-resource synthesis of binaural audio from single-source, single-channel (mono) recordings is a crucial step towards advancing augmented reality (AR) and virtual reality (VR) technologies.

Conventional mono-to-binaural synthesis techniques leverage a digital signal processing (DSP) framework. Within this framework, the way sound is scattered across the room to the listener’s ears is formally described by the head-related transfer function and the room impulse response. These functions, along with the ambient noise, are modeled as linear time-invariant systems and are obtained in a meticulous process for each simulated room. Such DSP-based approaches are prevalent in commercial applications due to their established theoretical foundation and their ability to generate perceptually realistic audio experiences.

Considering these limitations in conventional approaches, the possibility of using machine learning to synthesize binaural audio from monophonic sources is very appealing. However, doing so using standard supervised learning models is still very difficult. This is due to two primary challenges: (1) the scarcity of position-annotated binaural audio datasets, and (2) the inherent variability of real-world environments, characterized by diverse room acoustics and background noise conditions. Moreover, supervised models are susceptible to overfitting to the specific rooms, speaker characteristics, and languages in the training data, especially when their training dataset is small.

To address these limitations, we present ZeroBAS, the first zero-shot method for neural mono-to-binaural audio synthesis, which leverages geometric time warping, amplitude scaling, and a (monaural) denoising vocoder. Notably, we achieve natural binaural audio generation that is perceptually on par with existing supervised methods, despite never seeing binaural data. We further present a novel dataset-building approach and dataset, TUT Mono-to-Binaural, derived from the location-annotated ambisonic recordings of speech events in the TUT Sound Events 2018 dataset. When evaluated on this out-of-distribution data, prior supervised methods exhibit degraded performance, while ZeroBAS continues to perform well.

Leave a Reply

Your email address will not be published. Required fields are marked *