Related papers: VisMin: Visual Minimal-Change Understanding

VisMin: Visual Minimal-Change Understanding

URL: http://arxiv.org/abs/2407.16772v2
Date: Wed, 22 Jan 2025 17:42:29 GMT
Title: VisMin: Visual Minimal-Change Understanding
Authors: Rabiul Awal, Saba Ahmadi, Le Zhang, Aishwarya Agrawal,
Abstract summary: We introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin)<n>VisMin requires models to predict the correct image-caption match given two images and two captions.<n>We build an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators.
Score: 7.226130826257802
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.

Related papers

Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z)
How Can Objects Help Video-Language Understanding? [16.63183488540909]
We introduce ObjectML, a framework capable of leveraging arbitrary computer vision algorithm to extract and structured visual representation.<n>Through extensive evaluations on six video question benchmarks, we confirm that explicit integration of object-centric representation remains necessary.<n>Surprisingly, we observe that the simple approach quantizing the continuous, structured object information and representing them as plain text performs the best.
arXiv Detail & Related papers (2025-04-10T04:59:28Z)
TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image. We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z)
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)
Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN) The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z)
TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. We parse objects and attributes from the description, which are highly likely to exist in the image. Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z)
IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks [124.90137528319273]
In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
arXiv Detail & Related papers (2023-12-04T09:48:29Z)
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification. An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z)
Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL) A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z)
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval [30.429340065755436]
We devise a new model-agnostic formulation for fine-grained cross-modal alignment. Inspired by optimal transport theory, we introduce emphTokenFlow, an instantiation of the proposed scheme.
arXiv Detail & Related papers (2022-09-28T04:11:05Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
Improving Few-shot Learning by Spatially-aware Matching and CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario. We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.