Related papers: Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

URL: http://arxiv.org/abs/2503.14895v1
Date: Wed, 19 Mar 2025 04:39:45 GMT
Title: Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Authors: Shuo Li, Jiajun Sun, Guodong Zheng, Xiaoran Fan, Yujiong Shen, Yi Lu, Zhiheng Xi, Yuming Yang, Wenming Tan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract summary: Large language models (MLLMs) have demonstrated remarkable performance in visual tasks.<n>However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations.<n>We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects.
Score: 44.83933994734478
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

Related papers

Stop learning it all to mitigate visual hallucination, Focus on the hallucination target [0.10571493942475592]
Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues.<n> hallucinations undermine model reliability in practical applications.<n>Mymethod is a preference learning approach that mitigates hallucinations by focusing on targeted areas.
arXiv Detail & Related papers (2025-06-13T02:35:03Z)
Mitigating Object Hallucination via Robust Local Perception Search [11.570368427723961]
Local Perception Search (LPS) is a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations.<n>We show that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.
arXiv Detail & Related papers (2025-06-07T09:27:26Z)
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM [58.2298313720146]
Multimodal hallucinations are multi-sourced and arise from diverse causes.<n>Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations.
arXiv Detail & Related papers (2025-05-30T05:54:36Z)
Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering [83.63437999696954]
hallucination in large language models (MLLMs) persists as a significant and under-addressed challenge in the video domain.<n>We propose a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules.
arXiv Detail & Related papers (2025-05-19T08:12:06Z)
Robust Hallucination Detection in LLMs via Adaptive Token Selection [25.21763722332831]
Hallucinations in large language models (LLMs) pose significant safety concerns that impede their broader deployment. We propose HaMI, a novel approach that enables robust detection of hallucinations through adaptive selection and learning of critical tokens. We achieve this robustness by an innovative formulation of the Hallucination detection task as Multiple Instance (HaMI) learning over token-level representations within a sequence.
arXiv Detail & Related papers (2025-04-10T15:39:10Z)
Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost. Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens. Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.<n>This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.<n>Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models [15.156359255401812]
We propose an open-set, dynamic protocol to evaluate hallucinations in large language models (MLLMs)<n>ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them.<n>It generates varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks.
arXiv Detail & Related papers (2024-09-14T05:31:29Z)
Unified Hallucination Detection for Multimodal Large Language Models [44.333451078750954]
Multimodal Large Language Models (MLLMs) are plagued by the critical issue of hallucination. We present a novel meta-evaluation benchmark, MHaluBench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods. We unveil a novel unified multimodal hallucination detection framework, UNIHD, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly.
arXiv Detail & Related papers (2024-02-05T16:56:11Z)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z)
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models [20.33971942003996]
This study introduces an innovative method to address event-level hallucinations in MLLMs. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. We employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences.
arXiv Detail & Related papers (2024-01-18T10:18:48Z)
AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall. We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
Dynamic Spectrum Mixer for Visual Recognition [17.180863898764194]
We propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM) DSM represents token interactions in the frequency domain by employing the Cosine Transform. It can learn long-term spatial dependencies with log-linear complexity.
arXiv Detail & Related papers (2023-09-13T04:51:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.