Joint Blind Room Acoustic Characterization From Speech And Music Signals
Using Convolutional Recurrent Neural Networks
- URL: http://arxiv.org/abs/2010.11167v1
- Date: Wed, 21 Oct 2020 17:41:21 GMT
- Title: Joint Blind Room Acoustic Characterization From Speech And Music Signals
Using Convolutional Recurrent Neural Networks
- Authors: Paul Callens, Milos Cernak
- Abstract summary: Reverberation time, clarity, and direct-to-reverberant ratio are acoustic parameters that have been defined to describe reverberant environments.
Recent audio combined with machine learning suggests that one could estimate those parameters blindly using speech or music signals.
We propose a robust end-to-end method to achieve blind joint acoustic parameter estimation using speech and/or music signals.
- Score: 13.12834490248018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acoustic environment characterization opens doors for sound reproduction
innovations, smart EQing, speech enhancement, hearing aids, and forensics.
Reverberation time, clarity, and direct-to-reverberant ratio are acoustic
parameters that have been defined to describe reverberant environments. They
are closely related to speech intelligibility and sound quality. As explained
in the ISO3382 standard, they can be derived from a room measurement called the
Room Impulse Response (RIR). However, measuring RIRs requires specific
equipment and intrusive sound to be played. The recent audio combined with
machine learning suggests that one could estimate those parameters blindly
using speech or music signals. We follow these advances and propose a robust
end-to-end method to achieve blind joint acoustic parameter estimation using
speech and/or music signals. Our results indicate that convolutional recurrent
neural networks perform best for this task, and including music in training
also helps improve inference from speech.
Related papers
- Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features [5.678610585849838]
Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition.
Unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability.
This paper proposes a modified probing approach to explain deep learning embeddings in the speech emotion space.
arXiv Detail & Related papers (2024-09-14T19:18:56Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification [8.90841350214225]
We introduce a dual-encoder architecture that facilitates the estimation of room parameters directly from speech utterances.
A contrastive loss encoder function is employed to embed the speech and the acoustic response jointly.
In the test phase, only the reverberant utterance is available, and its embedding is used for the task of room shape classification.
arXiv Detail & Related papers (2024-06-05T10:13:55Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Neural Acoustic Context Field: Rendering Realistic Room Impulse Response
With Neural Fields [61.07542274267568]
This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene.
Driven by the unique properties of RIR, we design a temporal correlation module and multi-scale energy decay criterion.
Experimental results show that NACF outperforms existing field-based methods by a notable margin.
arXiv Detail & Related papers (2023-09-27T19:50:50Z) - PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement [41.872384434583466]
We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
arXiv Detail & Related papers (2023-02-16T05:17:06Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.