Convoifilter: A case study of doing cocktail party speech recognition
- URL: http://arxiv.org/abs/2308.11380v3
- Date: Sun, 7 Apr 2024 13:27:08 GMT
- Title: Convoifilter: A case study of doing cocktail party speech recognition
- Authors: Thai-Binh Nguyen, Alexander Waibel,
- Abstract summary: The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach.
We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
- Score: 59.80042864360884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
Related papers
- D4AM: A General Denoising Framework for Downstream Acoustic Models [45.04967351760919]
Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems.
Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems.
We propose a general denoising framework, D4AM, for various downstream acoustic models.
arXiv Detail & Related papers (2023-11-28T08:27:27Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - Unified End-to-End Speech Recognition and Endpointing for Fast and
Efficient Speech Systems [17.160006765475988]
We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) model.
We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model.
This results in a single E2E model that can be used during inference to perform frame filtering at low cost.
arXiv Detail & Related papers (2022-11-01T23:43:15Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Acoustic-to-articulatory Inversion based on Speech Decomposition and
Auxiliary Feature [7.363994037183394]
We pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding.
We then propose a novel auxiliary feature network to estimate the lip auxiliary features from the personalized speech features.
Experimental results show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0%.
arXiv Detail & Related papers (2022-04-02T14:47:19Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - A Unified Speaker Adaptation Approach for ASR [37.76683818356052]
We propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation.
For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers.
For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture.
arXiv Detail & Related papers (2021-10-16T10:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.