BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis
- URL: http://arxiv.org/abs/2205.14807v1
- Date: Mon, 30 May 2022 02:09:26 GMT
- Title: BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis
- Authors: Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu
Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu
- Abstract summary: We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
- Score: 129.86743102915986
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Binaural audio plays a significant role in constructing immersive augmented
and virtual realities. As it is expensive to record binaural audio from the
real world, synthesizing them from mono audio has attracted increasing
attention. This synthesis process involves not only the basic physical warping
of the mono audio, but also room reverberations and head/ear related
filtrations, which, however, are difficult to accurately simulate in
traditional digital signal processing. In this paper, we formulate the
synthesis process from a different perspective by decomposing the binaural
audio into a common part that shared by the left and right channels as well as
a specific part that differs in each channel. Accordingly, we propose
BinauralGrad, a novel two-stage framework equipped with diffusion models to
synthesize them respectively. Specifically, in the first stage, the common
information of the binaural audio is generated with a single-channel diffusion
model conditioned on the mono audio, based on which the binaural audio is
generated by a two-channel diffusion model in the second stage. Combining this
novel perspective of two-stage synthesis with advanced generative models (i.e.,
the diffusion models),the proposed BinauralGrad is able to generate accurate
and high-fidelity binaural audio samples. Experiment results show that on a
benchmark dataset, BinauralGrad outperforms the existing baselines by a large
margin in terms of both object and subject evaluation metrics (Wave L2: 0.128
vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples are available
online.
Related papers
- Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.
We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources.
By leveraging spatial guidance, our unified model achieves the objective of generating immersive and controllable spatial audio from text and image.
arXiv Detail & Related papers (2024-10-14T16:18:29Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio
Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models.
In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model.
Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - SpecSinGAN: Sound Effect Variation Synthesis Using Single-Image GANs [0.0]
Single-image generative adversarial networks learn from the internal distribution of a single training example to generate variations of it.
SpecSinGAN takes a single one-shot sound effect and produces novel variations of it, as if they were different takes from the same recording session.
arXiv Detail & Related papers (2021-10-14T12:25:52Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.