End-to-End Integration of Speech Recognition, Speech Enhancement, and
Self-Supervised Learning Representation
- URL: http://arxiv.org/abs/2204.00540v1
- Date: Fri, 1 Apr 2022 16:02:31 GMT
- Title: End-to-End Integration of Speech Recognition, Speech Enhancement, and
Self-Supervised Learning Representation
- Authors: Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe
- Abstract summary: This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition.
Compared with conventional E2E ASR models, the proposed E2E model integrates two important modules.
The IRIS model achieves the best performance reported in the literature for the single-channel CHiME-4 benchmark.
- Score: 36.66970917185465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents our end-to-end (E2E) automatic speech recognition (ASR)
model targetting at robust speech recognition, called Integraded speech
Recognition with enhanced speech Input for Self-supervised learning
representation (IRIS). Compared with conventional E2E ASR models, the proposed
E2E model integrates two important modules including a speech enhancement (SE)
module and a self-supervised learning representation (SSLR) module. The SE
module enhances the noisy speech. Then the SSLR module extracts features from
enhanced speech to be used for speech recognition (ASR). To train the proposed
model, we establish an efficient learning scheme. Evaluation results on the
monaural CHiME-4 task show that the IRIS model achieves the best performance
reported in the literature for the single-channel CHiME-4 benchmark (2.0% for
the real development and 3.9% for the real test) thanks to the powerful
pre-trained SSLR module and the fine-tuned SE module.
Related papers
- Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition [12.77573161345651]
This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR.
The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling.
arXiv Detail & Related papers (2023-12-06T18:34:42Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - A Comparative Study of Modular and Joint Approaches for
Speaker-Attributed ASR on Monaural Long-Form Audio [45.04646762560459]
Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings.
Considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data.
We present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings.
arXiv Detail & Related papers (2021-07-06T19:36:48Z) - Long-Running Speech Recognizer:An End-to-End Multi-Task Learning
Framework for Online ASR and VAD [10.168591454648123]
This paper presents a novel end-to-end (E2E), multi-task learning (MTL) framework that integrates ASR and VAD into one model.
The proposed system, which we refer to as Long-Running Speech Recognizer (LR-SR), learns ASR and VAD jointly from two seperate task-specific datasets in the training stage.
In the inference stage, the LR-SR system removes non-speech parts at low computational cost and recognizes speech parts with high robustness.
arXiv Detail & Related papers (2021-03-02T11:49:03Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Investigation of End-To-End Speaker-Attributed ASR for Continuous
Multi-Talker Recordings [40.99930744000231]
We extend the prior work by addressing the case where no speaker profile is available.
We perform speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model.
We also propose a simple modification to the reference labels of the E2E SA-ASR training which helps handle continuous multi-talker recordings well.
arXiv Detail & Related papers (2020-08-11T06:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.