Related papers: Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

URL: http://arxiv.org/abs/2406.08402v1
Date: Wed, 12 Jun 2024 16:51:54 GMT
Title: Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
Authors: Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee,
Abstract summary: We introduce methods to assess the extent of object hallucination of publicly available LALMs. Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content. We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
Score: 49.87432626548563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large audio-language models (LALMs) enhance traditional large language models by integrating audio perception capabilities, allowing them to tackle audio-related tasks. Previous research has primarily focused on assessing the performance of LALMs across various tasks, yet overlooking their reliability, particularly concerning issues like object hallucination. In our study, we introduce methods to assess the extent of object hallucination of publicly available LALMs. Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content, but struggle to answer discriminative questions, specifically those requiring the identification of the presence of particular object sounds within an audio clip. This limitation highlights a critical weakness in current LALMs: their inadequate understanding of discriminative queries. Moreover, we explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.

Related papers

The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents [21.736748922886555]
This research uncovers a novel privacy risk associated with multimodal large language models (MLLMs)<n>The ability to infer sensitive personal attributes from audio data -- a technique we term audio private attribute profiling -- poses a significant threat.<n>We propose Gifts, a hybrid multi-agent framework that leverages the complementary strengths of audio-language models (ALMs) and large language models (LLMs) to enhance inference capabilities.
arXiv Detail & Related papers (2025-07-14T07:51:56Z)
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z)
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information [44.99833362998488]
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc.<n>While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored.<n>We introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information.<n>Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly.
arXiv Detail & Related papers (2025-05-19T15:20:32Z)
ACVUBench: Audio-Centric Video Understanding Benchmark [35.77437191750556]
ACVUBench is an audio-centric video understanding benchmark. It incorporates 2,662 videos spanning 18 different domains with rich auditory information. It holistically tests the understanding of both audio content and audio-visual interactions in videos.
arXiv Detail & Related papers (2025-03-25T16:28:24Z)
Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak [35.62727804915181]
This paper investigates how audio-specific edits influence Large Audio-Language Models (LALMs) inference regarding jailbreak. We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection. We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits.
arXiv Detail & Related papers (2025-01-23T15:51:38Z)
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.
arXiv Detail & Related papers (2024-12-06T16:34:15Z)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities. Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial. We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z)
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models [27.430040932849018]
We introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities. Simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.
arXiv Detail & Related papers (2024-10-23T23:36:06Z)
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z)
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
Audio Question Answering has garnered attention due to the advent of Large Audio Language Models. While LALMs excel in general audio understanding, they are limited in temporal reasoning. This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z)
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio. Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z)
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks. LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z)
Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation [109.8527403904657]
We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well. Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries. We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
arXiv Detail & Related papers (2023-07-20T16:46:10Z)
Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.