Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
- URL: http://arxiv.org/abs/2505.22251v2
- Date: Thu, 05 Jun 2025 10:40:05 GMT
- Title: Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition
- Authors: Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya,
- Abstract summary: This work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora.<n>To measure contamination impact, LLMs trained with/without contamination are compared.<n>Results show that LLM outputs can be biased by tiny amounts of data contamination.
- Score: 16.1813157127555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure contamination impact, LLMs trained with/without contamination are compared. A contaminated LLM is more likely to generate test sentences it has seen during training. Then, speech recognisers based on LLMs are compared. They show only subtle error rate differences if the LLM is contaminated, but assign significantly higher probabilities to transcriptions seen during LLM training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.
Related papers
- An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z) - Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context [12.781022584125925]
We construct a novel, controlled contrastive dataset designed to test whether LLMs can effectively use context to disambiguate idiomatic meaning.
Our findings reveal that LLMs often fail to resolve idiomaticity when it is required to attend to the surrounding context.
We make our code and dataset publicly available.
arXiv Detail & Related papers (2024-10-21T14:47:37Z) - Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment [53.17596274334017]
We evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs.<n>Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
arXiv Detail & Related papers (2024-10-06T08:33:39Z) - Are LLMs Good Annotators for Discourse-level Event Relation Extraction? [15.365993658296016]
We assess the effectiveness of Large Language Models (LLMs) in addressing discourse-level event relation extraction tasks.<n> Evaluation is conducted using an commercial model, GPT-3.5, and an open-source model, LLaMA-2.
arXiv Detail & Related papers (2024-07-28T19:27:06Z) - Data Contamination Can Cross Language Barriers [29.103517721155487]
The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data.
We first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods.
We propose generalization-based approaches to unmask such deeply concealed contamination.
arXiv Detail & Related papers (2024-06-19T05:53:27Z) - Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings.
We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z) - The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM)
We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions.
Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z) - Purifying Large Language Models by Ensembling a Small Language Model [39.57304668057076]
We propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data.
We empirically confirm the efficacy of ensembling LLMs with benign and small language models (SLMs)
arXiv Detail & Related papers (2024-02-19T14:00:39Z) - Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision.
We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E.
We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - Are Large Language Models Temporally Grounded? [38.481606493496514]
We provide Large language models (LLMs) with textual narratives.
We probe them with respect to their common-sense knowledge of the structure and duration of events.
We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
arXiv Detail & Related papers (2023-11-14T18:57:15Z) - Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning.
We propose a novel method, termed "reflection-tuning"
This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT)
This paper systematically investigates the advantages and challenges of LLMs for MMT.
We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.