Long-Running Speech Recognizer:An End-to-End Multi-Task Learning
Framework for Online ASR and VAD
- URL: http://arxiv.org/abs/2103.01661v1
- Date: Tue, 2 Mar 2021 11:49:03 GMT
- Title: Long-Running Speech Recognizer:An End-to-End Multi-Task Learning
Framework for Online ASR and VAD
- Authors: Meng Li, Shiyu Zhou, Bo Xu
- Abstract summary: This paper presents a novel end-to-end (E2E), multi-task learning (MTL) framework that integrates ASR and VAD into one model.
The proposed system, which we refer to as Long-Running Speech Recognizer (LR-SR), learns ASR and VAD jointly from two seperate task-specific datasets in the training stage.
In the inference stage, the LR-SR system removes non-speech parts at low computational cost and recognizes speech parts with high robustness.
- Score: 10.168591454648123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When we use End-to-end automatic speech recognition (E2E-ASR) system for
real-world applications, a voice activity detection (VAD) system is usually
needed to improve the performance and to reduce the computational cost by
discarding non-speech parts in the audio. This paper presents a novel
end-to-end (E2E), multi-task learning (MTL) framework that integrates ASR and
VAD into one model. The proposed system, which we refer to as Long-Running
Speech Recognizer (LR-SR), learns ASR and VAD jointly from two seperate
task-specific datasets in the training stage. With the assistance of VAD, the
ASR performance improves as its connectionist temporal classification (CTC)
loss function can leverage the VAD alignment information. In the inference
stage, the LR-SR system removes non-speech parts at low computational cost and
recognizes speech parts with high robustness. Experimental results on segmented
speech data show that the proposed MTL framework outperforms the baseline
single-task learning (STL) framework in ASR task. On unsegmented speech data,
we find that the LR-SR system outperforms the baseline ASR systems that build
an extra GMM-based or DNN-based voice activity detector.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - A Deep Learning System for Domain-specific Speech Recognition [0.0]
The author works with pre-trained DeepSpeech2 and Wav2Vec2 acoustic models to develop benefit-specific ASR systems.
The best performance comes from a fine-tuned Wav2Vec2-Large-LV60 acoustic model with an external KenLM.
The viability of using error prone ASR transcriptions as part of spoken language understanding (SLU) is also investigated.
arXiv Detail & Related papers (2023-03-18T22:19:09Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.