Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)
- URL: http://arxiv.org/abs/2408.04664v1
- Date: Tue, 6 Aug 2024 08:10:34 GMT
- Title: Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)
- Authors: Avshalom Manevich, Reut Tsarfaty,
- Abstract summary: Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities.
Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on Large Language Models distribution confidence levels.
Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models.
- Score: 13.430637580980164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to %4 improvement in POPE F1 scores and up to %36 reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms.
Related papers
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs [14.381188702947949]
Large Vision-Language Models (LVLMs) primarily align image features of vision encoder with Large Language Models (LLMs) to leverage their superior text generation capabilities.
This imbalance in LVLMs may result in the instances of hallucinatory.
We introduce a training-free algorithm to find an equilibrium point between image comprehension and language inference.
arXiv Detail & Related papers (2024-07-31T17:46:57Z) - Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation [28.648494997132925]
We propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free Sign Language Translation (SLT)
We factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder.
In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM's translation potential.
arXiv Detail & Related papers (2024-03-19T09:00:23Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z) - Aligned with LLM: a new multi-modal training paradigm for encoding fMRI
activity in visual cortex [4.57590454144072]
Recently, there has been a surge in the popularity of pre trained large language models (LLMs)
This paper proposes a new multi-modal training paradigm, aligning with LLM, encoding fMRI activity in visual cortex.
arXiv Detail & Related papers (2024-01-08T12:30:23Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.