Visually Dehallucinative Instruction Generation
- URL: http://arxiv.org/abs/2402.08348v1
- Date: Tue, 13 Feb 2024 10:25:45 GMT
- Title: Visually Dehallucinative Instruction Generation
- Authors: Sungguk Cha, Jusung Lee, Younghyun Lee, Cheoljong Yang
- Abstract summary: This paper presents a novel and scalable method for generating visually dehallucinative instructions, dubbed CAP2QA, that constrains the scope to only image contents.
It shows that our proposed method significantly reduces visual hallucination while consistently improving visual recognition ability and expressiveness.
- Score: 0.8192907805418583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, synthetic visual instructions by generative language model
have demonstrated plausible text generation performance on the visual
question-answering tasks. However, challenges persist in the hallucination of
generative language models, i.e., the generated image-text data contains
unintended contents. This paper presents a novel and scalable method for
generating visually dehallucinative instructions, dubbed CAP2QA, that
constrains the scope to only image contents. Our key contributions lie in
introducing image-aligned instructive QA dataset CAP2QA-COCO and its scalable
recipe. In our experiments, we compare synthetic visual instruction datasets
that share the same source data by visual instruction tuning and conduct
general visual recognition tasks. It shows that our proposed method
significantly reduces visual hallucination while consistently improving visual
recognition ability and expressiveness.
Related papers
- ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.
The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements.
We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Mitigating Object Hallucinations in Large Vision-Language Models through
Visual Contrastive Decoding [125.05295513481035]
We introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs.
The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations.
Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families.
arXiv Detail & Related papers (2023-11-28T16:26:35Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - ORES: Open-vocabulary Responsible Visual Synthesis [104.7572323359984]
We formalize a new task, Open-vocabulary Responsible Visual Synthesis (ORES), where the synthesis model is able to avoid forbidden visual concepts.
To address this problem, we present a Two-stage Intervention (TIN) framework.
By introducing 1) rewriting with learnable instruction through a large-scale language model (LLM) and 2) synthesizing with prompt intervention on a diffusion model, it can effectively synthesize images avoiding any concepts but following the user's query as much as possible.
arXiv Detail & Related papers (2023-08-26T06:47:34Z) - TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation [20.00366398989886]
We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces.
TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words.
Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution.
arXiv Detail & Related papers (2023-07-27T03:56:39Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.