LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
- URL: http://arxiv.org/abs/2406.04432v1
- Date: Thu, 6 Jun 2024 18:17:59 GMT
- Title: LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
- Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha,
- Abstract summary: LipGER is a framework for leveraging visual cues for noise-robust ASR.
We show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%.
We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs equipped with lip motion cues.
- Score: 46.438575751932866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs that is additionally equipped with lip motion cues to promote further research in this space
Related papers
- LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition [12.336693356113308]
We propose a novel framework, LipGen, to improve model robustness.
We introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms.
Our method demonstrates superior performance compared to the current state-of-the-art on the lip reading in the wild (LRW) dataset.
arXiv Detail & Related papers (2025-01-08T00:52:19Z) - Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition [39.206005299985605]
We propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of listening and seeing again''
The proposed AVGER can reduce Word Error Rate (WER) by 24% compared to current mainstream AVSR systems.
arXiv Detail & Related papers (2025-01-03T10:51:14Z) - LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation [15.520180125182756]
Recent advancements in integrating speech information into large language models (LLMs) have significantly improved automatic speech recognition (ASR) accuracy.
Existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents.
We propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR.
arXiv Detail & Related papers (2024-09-13T07:28:47Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for
Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness.
In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively.
Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.