AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
- URL: http://arxiv.org/abs/2410.18325v1
- Date: Wed, 23 Oct 2024 23:36:06 GMT
- Title: AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
- Authors: Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, Tae-Hyun Oh,
- Abstract summary: We introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual models.
Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities.
Simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.
- Score: 27.430040932849018
- License:
- Abstract: Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promising developments, the lack of dedicated benchmarks poses challenges for understanding and evaluating models. In this work, we show that audio-visual LLMs struggle to discern subtle relationships between audio and visual signals, leading to hallucinations, underscoring the need for reliable benchmarks. To address this, we introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs. Our benchmark includes tests for assessing hallucinations, as well as the cross-modal matching and reasoning abilities of these models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships. Additionally, we demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.
Related papers
- Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.
LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.
We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.
This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.
Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z) - VidHal: Benchmarking Temporal Hallucinations in Vision LLMs [9.392258475822915]
We introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations.
A defining feature of VidHal is the careful creation of captions which represent varying levels of captions associated with each video.
We propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent.
arXiv Detail & Related papers (2024-11-25T06:17:23Z) - The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio [118.75449542080746]
This paper presents the first systematic investigation of hallucinations in large multimodal models (LMMs)
Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations.
Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning.
arXiv Detail & Related papers (2024-10-16T17:59:02Z) - Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models [49.87432626548563]
We introduce methods to assess the extent of object hallucination of publicly available LALMs.
Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content.
We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
arXiv Detail & Related papers (2024-06-12T16:51:54Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions [40.79317187623401]
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP)
LLMs are prone to hallucination, generating plausible yet nonfactual content.
This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval systems.
arXiv Detail & Related papers (2023-11-09T09:25:37Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.