Mitigating Object Hallucination in Large Vision-Language Models via
  Classifier-Free Guidance
        - URL: http://arxiv.org/abs/2402.08680v1
- Date: Tue, 13 Feb 2024 18:59:05 GMT
- Title: Mitigating Object Hallucination in Large Vision-Language Models via
  Classifier-Free Guidance
- Authors: Linxi Zhao and Yihe Deng and Weitong Zhang and Quanquan Gu
- Abstract summary: Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
- Score: 56.04768229686853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   The advancement of Large Vision-Language Models (LVLMs) has increasingly
highlighted the critical issue of their tendency to hallucinate non-existing
objects in the images. To address this issue, previous works focused on using
specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the
outputs of LVLMs. However, these approaches require either expensive
training/fine-tuning or API access to advanced LLMs to correct the model's
output post-generation. In this paper, we tackle this challenge by introducing
a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE
(MARINE), which is both training-free and API-free, and can effectively and
efficiently reduce object hallucinations during the generation process.
Specifically, MARINE enriches the visual context of LVLMs by integrating
existing open-source vision models, and employs classifier-free guidance to
incorporate the additional object grounding features to improve the precision
of LVLMs' generations. Through comprehensive evaluations across $6$ popular
LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of
MARINE, which even outperforms existing fine-tuning-based methods. Remarkably,
it not only reduces hallucinations but also improves the detailedness of LVLMs'
generations, as assessed by GPT-4V.
 
      
        Related papers
        - Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination   via Attention Lens [0.0]
 Large vision-language models (LVLMs) have demonstrated remarkable multimodal comprehension and reasoning capabilities.<n>LVLMs tend to over-rely on textual prompts and internal knowledge of large language models, generating descriptions inconsistent with visual cues.<n>We propose a training-free method to mitigate object hallucination.
 arXiv  Detail & Related papers  (2025-08-04T13:40:59Z)
- An LLM-Empowered Low-Resolution Vision System for On-Device Human   Behavior Understanding [7.588486998437453]
 We propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU.<n>The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions.<n>Llambda outperforms several state-of-the-art LVLM systems up to $40.03%$ on average Bert-Score.
 arXiv  Detail & Related papers  (2025-05-03T08:46:04Z)
- CutPaste&Find: Efficient Multimodal Hallucination Detector with   Visual-aid Knowledge Base [29.477973983931083]
 We propose CutPaste&Find, a lightweight and training-free framework for detecting hallucinations in LVLM-generated outputs.<n>At the core of our framework is a Visual-aid Knowledge Base that encodes rich entity-attribute relationships and associated image representations.<n>We introduce a scaling factor to refine similarity scores, mitigating the issue of suboptimal alignment values even for ground-truth image-text pairs.
 arXiv  Detail & Related papers  (2025-02-18T07:06:36Z)
- Mitigating Hallucination for Large Vision Language Model by   Inter-Modality Correlation Calibration Decoding [66.06337890279839]
 Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
 arXiv  Detail & Related papers  (2025-01-03T17:56:28Z)
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary   Embedding Distillation [95.78870389271832]
 The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
 arXiv  Detail & Related papers  (2024-12-12T18:55:18Z)
- A Survey of Hallucination in Large Visual Language Models [48.794850395309076]
 The existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields.
The structure of LVLMs and main causes of hallucination generation are introduced.
The available hallucination evaluation benchmarks for LVLMs are presented.
 arXiv  Detail & Related papers  (2024-10-20T10:58:58Z)
- Iter-AHMCL: Alleviate Hallucination for Large Language Model via   Iterative Model-level Contrastive Learning [16.883679810267342]
 Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination.
This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination.
 arXiv  Detail & Related papers  (2024-10-16T00:15:40Z)
- Rethinking VLMs and LLMs for Image Classification [6.550471260627169]
 Large Language Models (LLMs) are increasingly being merged with Visual Language Models (VLMs) to enable new capabilities.
We show that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do.
We propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task.
 arXiv  Detail & Related papers  (2024-10-03T23:40:21Z)
- CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing   Hallucinations in LVLMs [37.98496239547762]
 Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment.
We present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs.
 arXiv  Detail & Related papers  (2024-08-19T21:56:20Z)
- Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via   Language-Contrastive Decoding (LCD) [13.430637580980164]
 Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities.
Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on Large Language Models distribution confidence levels.
Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models.
 arXiv  Detail & Related papers  (2024-08-06T08:10:34Z)
- Debiasing Multimodal Large Language Models [61.6896704217147]
 Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
 arXiv  Detail & Related papers  (2024-03-08T12:35:07Z)
- Finer: Investigating and Enhancing Fine-Grained Visual Concept   Recognition in Large Vision Language Models [57.95366341738857]
 In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
 arXiv  Detail & Related papers  (2024-02-26T05:43:51Z)
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [49.32669226551026]
 We propose a simple yet effective training strategy MoE-Tuning for LVLMs.
MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.
Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
 arXiv  Detail & Related papers  (2024-01-29T08:13:40Z)
- Machine Vision Therapy: Multimodal Large Language Models Can Enhance   Visual Robustness via Denoising In-Context Learning [67.0609518552321]
 We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
 arXiv  Detail & Related papers  (2023-12-05T07:29:14Z)
- Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large   Image-Language Models [50.653838482083614]
 This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
 arXiv  Detail & Related papers  (2023-12-03T16:39:36Z)
- ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large   Language Models via Transferable Adversarial Attacks [91.55895047448249]
 This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
 arXiv  Detail & Related papers  (2023-10-19T06:37:32Z)
- CIEM: Contrastive Instruction Evaluation Method for Better Instruction
  Tuning [8.217445461627797]
 Vision-Language Models (VLMs) may generate incorrect perception information when doing downstream applications, for example, captioning a non-existent entity.
To address the hallucination phenomenon, we introduce a Contrastive Instruction Evaluation Method (CIEM) and Contrastive Instruction Tuning (CIT)
We pinpoint the hallucination issues commonly present in existing VLMs, the disability of the current instruction-tuning dataset to handle the hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM and public datasets.
 arXiv  Detail & Related papers  (2023-09-05T15:06:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.