SALSA: Speedy ASR-LLM Synchronous Aggregation
- URL: http://arxiv.org/abs/2408.16542v1
- Date: Thu, 29 Aug 2024 14:00:57 GMT
- Title: SALSA: Speedy ASR-LLM Synchronous Aggregation
- Authors: Ashish Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi,
- Abstract summary: We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders.
We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
- Score: 40.91241351045586
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
Related papers
- Better Instruction-Following Through Minimum Bayes Risk [48.879360919760074]
General-purpose LLM judges capable of human-level evaluation provide a scalable and accurate way of evaluating instruction-following LLMs.
One promising way of leveraging LLM judges for supervision is through Minimum Bayes Risk (MBR) decoding.
MBR decoding uses a reference-based evaluator to select a high-quality output from amongst a set of candidate outputs.
arXiv Detail & Related papers (2024-10-03T18:48:38Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction [34.32834323898953]
generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors.
This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text utterances.
We also introduce a new multi-pass augmented generative error correction (MPA GER) by integrating multiple system hypotheses on the input side with corrections from multiple LLMs on the output side and then merging them.
arXiv Detail & Related papers (2024-08-29T00:18:12Z) - Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z) - Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition [21.516152600963775]
Denoising LM (DLM) is a $textitscaled$ error correction model trained with vast amounts of synthetic data.
DLM achieves 1.5% word error rate (WER) on $textittest-clean$ and 3.3% WER on $textittest-other$ on Librispeech.
arXiv Detail & Related papers (2024-05-24T05:05:12Z) - An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM)
Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM.
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z) - ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for
Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness.
In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively.
Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z) - Can Generative Large Language Models Perform ASR Error Correction? [16.246481696611117]
generative large language models (LLMs) have been applied to a wide range of natural language processing tasks.
In this paper we investigate using ChatGPT, a generative LLM, for ASR error correction.
Experiments show that this generative LLM approach can yield performance gains for two different state-of-the-art ASR architectures.
arXiv Detail & Related papers (2023-07-09T13:38:25Z) - Joint Encoder-Decoder Self-Supervised Pre-training for ASR [0.0]
Self-supervised learning has shown tremendous success in various speech-related downstream tasks.
In this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning.
arXiv Detail & Related papers (2022-06-09T12:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.