Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription
- URL: http://arxiv.org/abs/2004.10799v3
- Date: Fri, 7 Aug 2020 19:36:44 GMT
- Title: Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription
- Authors: Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov
- Abstract summary: In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
- Score: 73.66530509749305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While end-to-end ASR systems have proven competitive with the conventional
hybrid approach, they are prone to accuracy degradation when it comes to noisy
and low-resource conditions. In this paper, we argue that, even in such
difficult cases, some end-to-end approaches show performance close to the
hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an
example of challenging environments and noisy conditions of everyday speech. We
experimentally compare and analyze CTC-Attention versus RNN-Transducer
approaches along with RNN versus Transformer architectures. We also provide a
comparison of acoustic features and speech enhancements. Besides, we evaluate
the effectiveness of neural network language models for hypothesis re-scoring
in low-resource conditions. Our best end-to-end model based on RNN-Transducer,
together with improved beam search, reaches quality by only 3.8% WER abs. worse
than the LF-MMI TDNN-F CHiME-6 Challenge baseline. With the Guided Source
Separation based training data augmentation, this approach outperforms the
hybrid baseline system by 2.7% WER abs. and the end-to-end system best known
before by 25.7% WER abs.
Related papers
- Hybrid Deep Convolutional Neural Networks Combined with Autoencoders And Augmented Data To Predict The Look-Up Table 2006 [2.082445711353476]
This study explores the development of a hybrid deep convolutional neural network (DCNN) model enhanced by autoencoders and data augmentation techniques.
By augmenting the original input features using three different autoencoder configurations, the model's predictive capabilities were significantly improved.
arXiv Detail & Related papers (2024-08-26T20:45:07Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - From Environmental Sound Representation to Robustness of 2D CNN Models
Against Adversarial Attacks [82.21746840893658]
This paper investigates the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
We show that while the ResNet-18 model trained on DWT spectrograms achieves a high recognition accuracy, attacking this model is relatively more costly for the adversary.
arXiv Detail & Related papers (2022-04-14T15:14:08Z) - Mitigating Closed-model Adversarial Examples with Bayesian Neural
Modeling for Enhanced End-to-End Speech Recognition [18.83748866242237]
We focus on a rigorous and empirical "closed-model adversarial robustness" setting.
We propose an advanced Bayesian neural network (BNN) based adversarial detector.
We improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets.
arXiv Detail & Related papers (2022-02-17T09:17:58Z) - Novel Hybrid DNN Approaches for Speaker Verification in Emotional and
Stressful Talking Environments [1.0998375857698495]
This work combined deep models with shallow architecture, which resulted in novel hybrid classifiers.
Four distinct hybrid models were utilized: deep neural network-hidden Markov model (DNN-HMM), deep neural network-Gaussian mixture model (DNN-GMM), and hidden Markov model-deep neural network (HMM-DNN)
Results showed that HMM-DNN outperformed all other hybrid models in terms of equal error rate (EER) and area under the curve (AUC) evaluation metrics.
arXiv Detail & Related papers (2021-12-26T10:47:14Z) - Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe.
We study different training aspects and methods to improve word-error-rate as well as to increase training speed.
We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z) - Improving RNN Transducer Based ASR with Auxiliary Tasks [21.60022481898402]
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results.
In this work, we examine ways in which recurrent neural network transducer (RNN-T) can achieve better ASR accuracy via performing auxiliary tasks.
arXiv Detail & Related papers (2020-11-05T21:46:32Z) - From Sound Representation to Model Robustness [82.21746840893658]
We investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
Averaged over various experiments on three environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures.
arXiv Detail & Related papers (2020-07-27T17:30:49Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.