Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
- URL: http://arxiv.org/abs/2406.08402v1
- Date: Wed, 12 Jun 2024 16:51:54 GMT
- Title: Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
- Authors: Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee,
- Abstract summary: We introduce methods to assess the extent of object hallucination of publicly available LALMs.
Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content.
We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
- Score: 49.87432626548563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks. Previous research has primarily focused on assessing the performance of LALMs across various tasks, yet overlooking their reliability, particularly concerning issues like object hallucination. In our study, we introduce methods to assess the extent of object hallucination of publicly available LALMs. Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content, but struggle to answer discriminative questions, specifically those requiring the identification of the presence of particular object sounds within an audio clip. This limitation highlights a critical weakness in current LALMs: their inadequate understanding of discriminative queries. Moreover, we explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
Related papers
- Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak [35.62727804915181]
This paper investigates how audio-specific edits influence Large Audio-Language Models (LALMs) inference regarding jailbreak.
We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection.
We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits.
arXiv Detail & Related papers (2025-01-23T15:51:38Z) - Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans.
Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans.
We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.
arXiv Detail & Related papers (2024-12-06T16:34:15Z) - AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.
Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.
We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z) - AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models [27.430040932849018]
We introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual models.
Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities.
Simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.
arXiv Detail & Related papers (2024-10-23T23:36:06Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.
LALMs excel in general audio understanding, but are limited in temporal reasoning.
This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z) - Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation [109.8527403904657]
We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well.
Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries.
We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
arXiv Detail & Related papers (2023-07-20T16:46:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.