Feature Replacement and Combination for Hybrid ASR Systems
- URL: http://arxiv.org/abs/2104.04298v1
- Date: Fri, 9 Apr 2021 11:04:58 GMT
- Title: Feature Replacement and Combination for Hybrid ASR Systems
- Authors: Peter Vieting, Christoph L\"uscher, Wilfried Michel, Ralf Schl\"uter,
Hermann Ney
- Abstract summary: We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
- Score: 47.74348197215634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acoustic modeling of raw waveform and learning feature extractors as part of
the neural network classifier has been the goal of many studies in the area of
automatic speech recognition (ASR). Recently, one line of research has focused
on frameworks that can be pre-trained on audio-only data in an unsupervised
fashion and aim at improving downstream ASR tasks. In this work, we investigate
the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid
ASR systems. In addition to deploying a pre-trained feature extractor, we
explore how to make use of an existing acoustic model (AM) trained on the same
task with different features as well. Another neural front-end which is only
trained together with the supervised ASR loss as well as traditional Gammatone
features are applied for comparison. Moreover, it is shown that the AM can be
retrofitted with i-vectors for speaker adaptation. Finally, the described
features are combined in order to further advance the performance. With the
final best system, we obtain a relative improvement of 4% and 6% over our
previous best model on the LibriSpeech test-clean and test-other sets.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - D4AM: A General Denoising Framework for Downstream Acoustic Models [45.04967351760919]
Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems.
Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems.
We propose a general denoising framework, D4AM, for various downstream acoustic models.
arXiv Detail & Related papers (2023-11-28T08:27:27Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Automated Audio Captioning using Transfer Learning and Reconstruction
Latent Space Similarity Regularization [21.216783537997426]
We propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task.
We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR)
arXiv Detail & Related papers (2021-08-10T13:49:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.