Related papers: RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

URL: http://arxiv.org/abs/2405.17821v1
Date: Tue, 28 May 2024 04:41:02 GMT
Title: RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs
Authors: Sangmin Woo, Jaehyuk Jang, Donguk Kim, Yubin Choi, Changick Kim,
Abstract summary: We propose a simple, training-free method termed RITUAL to enhance robustness against hallucinations in LVLMs. Our approach employs random image transformations as complements to the original probability distribution. Our empirical results show that while the isolated use of transformed images initially degrades performance, strategic implementation of these transformations can indeed serve as effective complements.
Score: 16.185253476874006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Vision Language Models (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs. Despite their impressive capabilities, they often produce "hallucinatory" outputs that do not accurately reflect the visual information, posing challenges in reliability and trustworthiness. Current methods such as contrastive decoding have made strides in addressing these issues by contrasting the original probability distribution of generated tokens with distorted counterparts; yet, generating visually-faithful outputs remains a challenge. In this work, we shift our focus to the opposite: What could serve as a complementary enhancement to the original probability distribution? We propose a simple, training-free method termed RITUAL to enhance robustness against hallucinations in LVLMs. Our approach employs random image transformations as complements to the original probability distribution, aiming to mitigate the likelihood of hallucinatory visual explanations by enriching the model's exposure to varied visual scenarios. Our empirical results show that while the isolated use of transformed images initially degrades performance, strategic implementation of these transformations can indeed serve as effective complements. Notably, our method is compatible with current contrastive decoding methods and does not require external models or costly self-feedback mechanisms, making it a practical addition. In experiments, RITUAL significantly outperforms existing contrastive decoding methods across several object hallucination benchmarks, including POPE, CHAIR, and MME.

Related papers

Exploring and Mitigating Fawning Hallucinations in Large Language Models [52.444712272909435]
We analyze fawning hallucinations in various natural language processing tasks.<n>We tailor the so-termed contrastive decoding method for fawning-hallucination mitigation.
arXiv Detail & Related papers (2025-08-31T14:29:54Z)
Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization [55.543583937522804]
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks.<n>Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate.<n>In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations.
arXiv Detail & Related papers (2025-08-27T18:02:04Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z)
OViP: Online Vision-Language Preference Learning [26.54737360667123]
Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs.<n>We propose an Online Vision-language Preference Learning framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs.<n>Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.
arXiv Detail & Related papers (2025-05-21T19:26:09Z)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z)
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input. We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z)
VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in multimodal task reasoning. We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfDecoding (VaLiD)
arXiv Detail & Related papers (2024-11-24T13:42:02Z)
ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models [11.75855265467876]
We introduce ConVis, a training-free contrastive decoding method. Our experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs.
arXiv Detail & Related papers (2024-08-25T18:02:36Z)
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models [51.70129969269271]
We introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE) Our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs.
arXiv Detail & Related papers (2024-06-04T03:04:21Z)
Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization [123.54980913741828]
Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data. They invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information. However, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations.
arXiv Detail & Related papers (2024-05-24T08:46:31Z)
Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales [102.54274021830207]
We introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. We filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Our approach also reduces hallucinations owing to its high correlation between images and text.
arXiv Detail & Related papers (2024-04-17T07:20:56Z)
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding [25.489832294197797]
This paper introduces the Instruction Contrastive Decoding (ICD) method, a novel approach designed to reduce hallucinations during LVLM inference. Our method is inspired by our observation that what we call disturbance instructions significantly exacerbate hallucinations in multimodal fusion modules.
arXiv Detail & Related papers (2024-03-27T16:04:47Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding [37.16880672402059]
Over-reliance on linguistic priors has been identified as a key factor leading to hallucinations. We propose to alleviate this problem by introducing a novel image-biased decoding technique. Our method derives the next-token probability distribution by contrasting predictions from a conventional LVLM with those of an image-biased LVLM.
arXiv Detail & Related papers (2024-02-28T16:57:22Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding [125.05295513481035]
We introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families.
arXiv Detail & Related papers (2023-11-28T16:26:35Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Learning from Multi-Perception Features for Real-Word Image Super-resolution [87.71135803794519]
We propose a novel SR method called MPF-Net that leverages multiple perceptual features of input images. Our method incorporates a Multi-Perception Feature Extraction (MPFE) module to extract diverse perceptual information. We also introduce a contrastive regularization term (CR) that improves the model's learning capability.
arXiv Detail & Related papers (2023-05-26T07:35:49Z)
VALHALLA: Visual Hallucination for Machine Translation [64.86515924691899]
We introduce a visual hallucination framework, called VALHALLA. It requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation. In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text.
arXiv Detail & Related papers (2022-05-31T20:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.