Evaluation of Deep-Learning-Based Voice Activity Detectors and Room
Impulse Response Models in Reverberant Environments
- URL: http://arxiv.org/abs/2106.13511v1
- Date: Fri, 25 Jun 2021 09:05:38 GMT
- Title: Evaluation of Deep-Learning-Based Voice Activity Detectors and Room
Impulse Response Models in Reverberant Environments
- Authors: Amir Ivry, Israel Cohen, Baruch Berdugo
- Abstract summary: State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data.
We simulate an augmented training set that contains nearly five million utterances.
We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set.
- Score: 13.558688470594676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art deep-learning-based voice activity detectors (VADs) are
often trained with anechoic data. However, real acoustic environments are
generally reverberant, which causes the performance to significantly
deteriorate. To mitigate this mismatch between training data and real data, we
simulate an augmented training set that contains nearly five million
utterances. This extension comprises of anechoic utterances and their
reverberant modifications, generated by convolutions of the anechoic utterances
with a variety of room impulse responses (RIRs). We consider five different
models to generate RIRs, and five different VADs that are trained with the
augmented training set. We test all trained systems in three different real
reverberant environments. Experimental results show $20\%$ increase on average
in accuracy, precision and recall for all detectors and response models,
compared to anechoic training. Furthermore, one of the RIR models consistently
yields better performance than the other models, for all the tested VADs.
Additionally, one of the VADs consistently outperformed the other VADs in all
experiments.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training [39.21885486667879]
Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges, including hallucination, outdated knowledge, and untraceable reasoning processes.
Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges.
We propose a novel RAG approach known as Retrieval-augmented Adaptive Adrial Training (RAAT)
arXiv Detail & Related papers (2024-05-31T16:24:53Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - Self-Supervised Pretraining Improves Performance and Inference
Efficiency in Multiple Lung Ultrasound Interpretation Tasks [65.23740556896654]
We investigated whether self-supervised pretraining could produce a neural network feature extractor applicable to multiple classification tasks in lung ultrasound analysis.
When fine-tuning on three lung ultrasound tasks, pretrained models resulted in an improvement of the average across-task area under the receiver operating curve (AUC) by 0.032 and 0.061 on local and external test sets respectively.
arXiv Detail & Related papers (2023-09-05T21:36:42Z) - A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text
Generation [59.64193903397301]
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
We conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR)
The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances.
arXiv Detail & Related papers (2021-10-11T13:05:06Z) - Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora [70.46867541361982]
We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL.
We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
arXiv Detail & Related papers (2021-09-23T00:43:32Z) - Noisy Training Improves E2E ASR for the Edge [22.91184103295888]
Automatic speech recognition (ASR) has become increasingly ubiquitous on modern edge devices.
E2E ASR models are prone to overfitting and have difficulties in generalizing to unseen testing data.
We present a simple yet effective noisy training strategy to further improve the E2E ASR model training.
arXiv Detail & Related papers (2021-07-09T20:56:20Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.