It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2402.05457v1
- Date: Thu, 8 Feb 2024 07:21:45 GMT
- Title: It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition
- Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu
Chen, Ensiong Chng, Chao-Han Huck Yang
- Abstract summary: Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
- Score: 70.77292069313154
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent studies have successfully shown that large language models (LLMs) can
be successfully used for generative error correction (GER) on top of the
automatic speech recognition (ASR) output. Specifically, an LLM is utilized to
carry out a direct mapping from the N-best hypotheses list generated by an ASR
system to the predicted output transcription. However, despite its
effectiveness, GER introduces extra data uncertainty since the LLM is trained
without taking into account acoustic information available in the speech
signal. In this work, we aim to overcome such a limitation by infusing acoustic
information before generating the predicted transcription through a novel late
fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a
multimodal fusion approach implemented into an auto-regressive decoding process
and works in two stages: (i) It first analyzes and calibrates the token-level
LLM decision, and (ii) it then dynamically assimilates the information from the
acoustic modality. Experimental evidence collected from various ASR tasks shows
that UADF surpasses existing fusion mechanisms in several ways. It yields
significant improvements in word error rate (WER) while mitigating data
uncertainty issues in LLM and addressing the poor generalization relied with
sole modality during fusion. We also demonstrate that UADF seamlessly adapts to
audio-visual speech recognition.
Related papers
- Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models [16.920823078873095]
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword.
We show on the real-world dataset of follow-up conversations that this approach yields large gains due to the joint modeling of the previous speech context and ASR uncertainty.
arXiv Detail & Related papers (2024-10-28T19:43:43Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Whispering LLaMA: A Cross-Modal Generative Error Correction Framework
for Speech Recognition [10.62060432965311]
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR)
Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts.
arXiv Detail & Related papers (2023-10-10T09:04:33Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Exploring the Integration of Large Language Models into Automatic Speech
Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems.
Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z) - Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM)
We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior.
Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.