CIEM: Contrastive Instruction Evaluation Method for Better Instruction
Tuning
- URL: http://arxiv.org/abs/2309.02301v2
- Date: Fri, 24 Nov 2023 07:07:03 GMT
- Title: CIEM: Contrastive Instruction Evaluation Method for Better Instruction
Tuning
- Authors: Hongyu Hu, Jiyuan Zhang, Minyi Zhao, Zhenbang Sun
- Abstract summary: Vision-Language Models (VLMs) may generate incorrect perception information when doing downstream applications, for example, captioning a non-existent entity.
To address the hallucination phenomenon, we introduce a Contrastive Instruction Evaluation Method (CIEM) and Contrastive Instruction Tuning (CIT)
We pinpoint the hallucination issues commonly present in existing VLMs, the disability of the current instruction-tuning dataset to handle the hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM and public datasets.
- Score: 8.217445461627797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, the research on Large Vision-Language Models (LVLMs) has been
significantly promoted thanks to the success of Large Language Models (LLM).
Nevertheless, these Vision-Language Models (VLMs) are suffering from the
drawback of hallucination -- due to insufficient understanding of vision and
language modalities, VLMs may generate incorrect perception information when
doing downstream applications, for example, captioning a non-existent entity.
To address the hallucination phenomenon, on the one hand, we introduce a
Contrastive Instruction Evaluation Method (CIEM), which is an automatic
pipeline that leverages an annotated image-text dataset coupled with an LLM to
generate factual/contrastive question-answer pairs for the evaluation of the
hallucination of VLMs. On the other hand, based on CIEM, we further propose a
new instruction tuning method called CIT (the abbreviation of Contrastive
Instruction Tuning) to alleviate the hallucination of VLMs by automatically
producing high-quality factual/contrastive question-answer pairs and
corresponding justifications for model tuning. Through extensive experiments on
CIEM and CIT, we pinpoint the hallucination issues commonly present in existing
VLMs, the disability of the current instruction-tuning dataset to handle the
hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM
and public datasets.
Related papers
- CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions [31.637204677787576]
We introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding.
KnowAda minimizes hallucinations while preserving high descriptiveness.
Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2024-11-13T20:50:04Z) - Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning [16.883679810267342]
Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination.
This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination.
arXiv Detail & Related papers (2024-10-16T00:15:40Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - IBD: Alleviating Hallucinations in Large Vision-Language Models via
Image-Biased Decoding [37.16880672402059]
Over-reliance on linguistic priors has been identified as a key factor leading to hallucinations.
We propose to alleviate this problem by introducing a novel image-biased decoding technique.
Our method derives the next-token probability distribution by contrasting predictions from a conventional LVLM with those of an image-biased LVLM.
arXiv Detail & Related papers (2024-02-28T16:57:22Z) - Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.