Related papers: LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation

LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2409.08597v1
Date: Fri, 13 Sep 2024 07:28:47 GMT
Title: LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation
Authors: Shaojun Li, Hengchao Shang, Daimeng Wei, Jiaxin Guo, Zongyao Li, Xianghui He, Min Zhang, Hao Yang,
Abstract summary: Recent advancements in integrating speech information into large language models (LLMs) have significantly improved automatic speech recognition (ASR) accuracy. Existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. We propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR.
Score: 15.520180125182756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in integrating speech information into large language models (LLMs) have significantly improved automatic speech recognition (ASR) accuracy. However, existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. To address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR. LA-RAG leverages fine-grained token-level speech datastores and a speech-to-speech retrieval mechanism to enhance ASR accuracy via LLM in-context learning (ICL) capabilities. Experiments on Mandarin and various Chinese dialect datasets demonstrate significant improvements in ASR accuracy compared to existing methods, validating the effectiveness of our approach, especially in handling accent variations.

Related papers

Speech Recognition With LLMs Adapted to Disordered Speech Using Reinforcement Learning [13.113505050543298]
We introduce a large language model capable of processing speech inputs. We show that tuning it further with reinforcement learning on human preference enables it to adapt better to disordered speech than traditional fine-tuning.
arXiv Detail & Related papers (2024-12-25T00:16:22Z)
A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario [9.290091297389033]
Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks. Their potential for addressing speech recognition challenges in low resource settings remains underexplored.
arXiv Detail & Related papers (2024-12-01T08:07:01Z)
Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs [20.97172337899685]
We propose pre-training large language models (LLMs) on Pinyin embedding sequences to generate corresponding Chinese characters. This step enables the LLM to adapt to generating text from pronunciation features before encountering real speech data. In AISHELL-1 corpus, our approach yields a 9.5% relative improvement in ASR tasks compared to the baseline.
arXiv Detail & Related papers (2024-09-24T12:06:31Z)
Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs) To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z)
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z)
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR) In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z)
Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems. Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z)
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.