End-to-End Binaural Speech Synthesis
- URL: http://arxiv.org/abs/2207.03697v1
- Date: Fri, 8 Jul 2022 05:18:36 GMT
- Title: End-to-End Binaural Speech Synthesis
- Authors: Wen Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru
and Anjali Menon
- Abstract summary: We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
- Score: 71.1869877389535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present an end-to-end binaural speech synthesis system that
combines a low-bitrate audio codec with a powerful binaural decoder that is
capable of accurate speech binauralization while faithfully reconstructing
environmental factors like ambient noise or reverb. The network is a modified
vector-quantized variational autoencoder, trained with several carefully
designed objectives, including an adversarial loss. We evaluate the proposed
system on an internal binaural dataset with objective metrics and a perceptual
study. Results show that the proposed approach matches the ground truth data
more closely than previous methods. In particular, we demonstrate the
capability of the adversarial loss in capturing environment effects needed to
create an authentic auditory scene.
Related papers
- Developing an Effective Training Dataset to Enhance the Performance of AI-based Speaker Separation Systems [0.3277163122167434]
We propose a novel method for constructing a realistic training set that includes mixture signals and corresponding ground truths for each speaker.
We get a 1.65 dB improvement in Scale Invariant Signal to Distortion Ratio (SI-SDR) for speaker separation accuracy in realistic mixing.
arXiv Detail & Related papers (2024-11-13T06:55:18Z) - Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation [8.170174172545831]
This paper addresses issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024.
We present an evaluation protocol combining objective metric, namely Fr'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation.
arXiv Detail & Related papers (2024-10-23T06:35:41Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI [20.432212333539628]
We introduce a novel coarse-to-fine audio reconstruction method based on functional Magnetic Resonance Imaging (fMRI) data.
We validate our method on three public fMRI datasets-Brain2Sound, Brain2Music, and Brain2Speech.
By employing semantic prompts during decoding, we enhance the quality of reconstructed audio when semantic features are suboptimal.
arXiv Detail & Related papers (2024-05-29T03:16:14Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Listen2Scene: Interactive material-aware binaural sound propagation for
reconstructed 3D scenes [69.03289331433874]
We present an end-to-end audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications.
We propose a novel neural-network-based sound propagation method to generate acoustic effects for 3D models of real environments.
arXiv Detail & Related papers (2023-02-02T04:09:23Z) - Timbre Transfer with Variational Auto Encoding and Cycle-Consistent
Adversarial Networks [0.6445605125467573]
This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality.
The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio.
arXiv Detail & Related papers (2021-09-05T15:06:53Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Joint Blind Room Acoustic Characterization From Speech And Music Signals
Using Convolutional Recurrent Neural Networks [13.12834490248018]
Reverberation time, clarity, and direct-to-reverberant ratio are acoustic parameters that have been defined to describe reverberant environments.
Recent audio combined with machine learning suggests that one could estimate those parameters blindly using speech or music signals.
We propose a robust end-to-end method to achieve blind joint acoustic parameter estimation using speech and/or music signals.
arXiv Detail & Related papers (2020-10-21T17:41:21Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.