Related papers: Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions

Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions

URL: http://arxiv.org/abs/2510.08581v1
Date: Fri, 19 Sep 2025 07:18:45 GMT
Title: Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions
Authors: Hansol Park, Hoseong Ahn, Junwon Moon, Yejin Lee, Kyuhong Shim,
Abstract summary: We investigate how spoken input influences hallucinations in large language models.<n>We present RePOPE-Spk, an audio-augmented extension of the RePOPE benchmark, where queries are provided as speech under diverse acoustic conditions.<n> Experimental results show that hallucinations escalate when queries are spoken rather than written.
Score: 10.361060366260729
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Hallucinations in vision-language models have been extensively studied using benchmarks that probe reliability in image-text settings. In contrast, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice-driven interfaces. In this work, we investigate how spoken input influences hallucinations in multimodal large language models. We present RePOPE-Spk, an audio-augmented extension of the RePOPE benchmark, where queries are provided as speech under diverse acoustic conditions. Using RePOPE-Spk, we systematically evaluate both proprietary and open-source models. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3% under clean speech and by up to 20% with environmental noise. Input order and query length further affect robustness, while strategies such as many-shot prompting and chain-of-thought reasoning offer partial but insufficient mitigation. These findings highlight a critical and underexplored challenge, opening new directions for building reliable voice interface systems.

Related papers

DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs [5.740643252319679]
This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese language models.<n>We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples.<n>A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80%, compared to a baseline encoder-only score of 32.83%.
arXiv Detail & Related papers (2026-01-08T08:27:47Z)
Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models [49.435669307386156]
Multi-stage Prompt Refinement (MPR) is a framework designed to systematically improve ill-formed prompts across multiple stages.<n>MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input.<n>Results on hallucination benchmarks show that MPR achieve over an 85% win rate compared to their original forms.
arXiv Detail & Related papers (2025-10-14T00:31:36Z)
Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models [0.0]
We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in large language models.<n>Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations.
arXiv Detail & Related papers (2025-08-03T17:29:48Z)
Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge [5.065947993017158]
Large language models (LLMs) have shown remarkable capabilities to generate coherent text.<n>They suffer from the issue of hallucinations -- factually inaccurate statements.<n>We investigate two state-of-the-art self-correcting systems by applying them to correct hallucinated summaries using evidence from three search engines.
arXiv Detail & Related papers (2025-06-24T13:20:31Z)
Towards Long Context Hallucination Detection [49.195854802543714]
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks.<n>They are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context.<n>We propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations.
arXiv Detail & Related papers (2025-04-28T03:47:05Z)
Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models [13.48296910438554]
We introduce Reefknot, a comprehensive benchmark targeting relation hallucinations, comprising over 20,000 real-world samples.<n>We provide a systematic definition of relation hallucinations, integrating perceptive and cognitive perspectives, and construct a relation-based corpus using the Visual Genome scene graph dataset.<n>We propose a novel confidence-based mitigation strategy, which reduces the hallucination rate by an average of 9.75% across three datasets, including Reefknot.
arXiv Detail & Related papers (2024-08-18T10:07:02Z)
Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks. They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z)
Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages. We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality. Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
On Hallucination and Predictive Uncertainty in Conditional Language Generation [76.18783678114325]
Higher predictive uncertainty corresponds to a higher chance of hallucination. Epistemic uncertainty is more indicative of hallucination than aleatoric or total uncertainties. It helps to achieve better results of trading performance in standard metric for less hallucination with the proposed beam search variant.
arXiv Detail & Related papers (2021-03-28T00:32:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.