NeIn: Telling What You Don't Want
- URL: http://arxiv.org/abs/2409.06481v1
- Date: Mon, 9 Sep 2024 04:54:34 GMT
- Title: NeIn: Telling What You Don't Want
- Authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Quoc-Huy Trinh, Minh-Triet Tran, Truong Nguyen, Susan Gauch,
- Abstract summary: Negation is a fundamental linguistic concept used by humans to convey information that they do not desire.
One barrier to achieving human-level intelligence is the lack of a standard collection by which research into negation can be evaluated.
This paper presents the first large-scale dataset, Negative Instruction (NeIn), for studying negation within the vision-language domain.
- Score: 6.666707176043472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Negation is a fundamental linguistic concept used by humans to convey information that they do not desire. Despite this, there has been minimal research specifically focused on negation within vision-language tasks. This lack of research means that vision-language models (VLMs) may struggle to understand negation, implying that they struggle to provide accurate results. One barrier to achieving human-level intelligence is the lack of a standard collection by which research into negation can be evaluated. This paper presents the first large-scale dataset, Negative Instruction (NeIn), for studying negation within the vision-language domain. Our dataset comprises 530,694 quadruples, i.e., source image, original caption, negative sentence, and target image in total, including 495,694 queries for training and 35,000 queries for benchmarking across multiple vision-language tasks. Specifically, we automatically generate NeIn based on a large, existing vision-language dataset, MS-COCO, via two steps: generation and filtering. During the generation phase, we leverage two VLMs, BLIP and MagicBrush, to generate the target image and a negative clause that expresses the content of the source image. In the subsequent filtering phase, we apply BLIP to remove erroneous samples. Additionally, we introduce an evaluation protocol for negation understanding of image editing models. Extensive experiments using our dataset across multiple VLMs for instruction-based image editing tasks demonstrate that even recent state-of-the-art VLMs struggle to understand negative queries. The project page is: https://tanbuinhat.github.io/NeIn/
Related papers
- Vision-Language Models Do Not Understand Negation [50.27667000027403]
NegBench is a benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets.
We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.
arXiv Detail & Related papers (2025-01-16T09:55:42Z) - Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation [63.064204206220936]
Foundational Large Language Models (LLMs) have changed the way we perceive technology.
They have been shown to excel in tasks ranging from poem writing to coding to essay generation and puzzle solving.
With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools.
Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content.
arXiv Detail & Related papers (2024-08-27T14:40:16Z) - How and where does CLIP process negation? [2.5600000778964294]
We build on the existence task from the VALSE benchmark to test models' understanding of negation.
We take inspiration from the literature on model interpretability to explain the behaviour of VL models on the understanding of negation.
arXiv Detail & Related papers (2024-07-15T07:20:06Z) - Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - Enhancing Multimodal Compositional Reasoning of Visual Language Models
with Generative Negative Mining [58.379339799777064]
Large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks.
We propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities.
Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
arXiv Detail & Related papers (2023-11-07T13:05:47Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.