HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models
- URL: http://arxiv.org/abs/2309.15701v2
- Date: Mon, 16 Oct 2023 05:47:42 GMT
- Title: HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models
- Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi,
Pin-Yu Chen, Eng Siong Chng
- Abstract summary: We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
- Score: 81.56455625624041
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advancements in deep neural networks have allowed automatic speech
recognition (ASR) systems to attain human parity on several publicly available
clean speech datasets. However, even state-of-the-art ASR systems experience
performance degradation when confronted with adverse conditions, as a
well-trained acoustic model is sensitive to variations in the speech domain,
e.g., background noise. Intuitively, humans address this issue by relying on
their linguistic knowledge: the meaning of ambiguous spoken terms is usually
inferred from contextual cues thereby reducing the dependency on the auditory
system. Inspired by this observation, we introduce the first open-source
benchmark to utilize external large language models (LLMs) for ASR error
correction, where N-best decoding hypotheses provide informative elements for
true transcription prediction. This approach is a paradigm shift from the
traditional language model rescoring strategy that can only select one
candidate hypothesis as the output transcription. The proposed benchmark
contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs
of N-best hypotheses and corresponding accurate transcriptions across prevalent
speech domains. Given this dataset, we examine three types of error correction
techniques based on LLMs with varying amounts of labeled
hypotheses-transcription pairs, which gains a significant word error rate (WER)
reduction. Experimental evidence demonstrates the proposed technique achieves a
breakthrough by surpassing the upper bound of traditional re-ranking based
methods. More surprisingly, LLM with reasonable prompt and its generative
capability can even correct those tokens that are missing in N-best list. We
make our results publicly accessible for reproducible pipelines with released
pre-trained models, thus providing a new evaluation paradigm for ASR error
correction with LLMs.
Related papers
- Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model [0.0]
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains.
We propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters.
arXiv Detail & Related papers (2024-10-24T01:58:11Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Whispering LLaMA: A Cross-Modal Generative Error Correction Framework
for Speech Recognition [10.62060432965311]
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR)
Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts.
arXiv Detail & Related papers (2023-10-10T09:04:33Z) - Integrated Semantic and Phonetic Post-correction for Chinese Speech
Recognition [1.2914521751805657]
We propose a novel approach to collectively exploit the contextualized representation and the phonetic information between the error and its replacing candidates to alleviate the error rate of Chinese ASR.
Our experiment results on real world speech recognition showed that our proposed method has evidently lower than the baseline model.
arXiv Detail & Related papers (2021-11-16T11:55:27Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.