Improved far-field speech recognition using Joint Variational
Autoencoder
- URL: http://arxiv.org/abs/2204.11286v1
- Date: Sun, 24 Apr 2022 14:14:04 GMT
- Title: Improved far-field speech recognition using Joint Variational
Autoencoder
- Authors: Shashi Kumar, Shakti P. Rath and Abhishek Pandey
- Abstract summary: We propose mapping speech features from far-field to close-talk using denoising autoencoder (DA)
Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
- Score: 5.320201231911981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Speech Recognition (ASR) systems suffer considerably when source
speech is corrupted with noise or room impulse responses (RIR). Typically,
speech enhancement is applied in both mismatched and matched scenario training
and testing. In matched setting, acoustic model (AM) is trained on
dereverberated far-field features while in mismatched setting, AM is fixed. In
recent past, mapping speech features from far-field to close-talk using
denoising autoencoder (DA) has been explored. In this paper, we focus on
matched scenario training and show that the proposed joint VAE based mapping
achieves a significant improvement over DA. Specifically, we observe an
absolute improvement of 2.5% in word error rate (WER) compared to DA based
enhancement and 3.96% compared to AM trained directly on far-field filterbank
features.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Data Augmentation based Consistency Contrastive Pre-training for
Automatic Speech Recognition [18.303072203996347]
Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task.
Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations.
In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training.
arXiv Detail & Related papers (2021-12-23T13:23:17Z) - Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition [27.530537066239116]
We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
arXiv Detail & Related papers (2021-07-02T21:01:17Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.