SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
- URL: http://arxiv.org/abs/2406.11171v2
- Date: Wed, 19 Jun 2024 00:03:42 GMT
- Title: SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
- Authors: Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad,
- Abstract summary: We introduce the SUGARCREPE++ dataset to analyze the sensitivity of vision-and-language models to semantic alterations.
We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++.
- Score: 13.608653575298183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in architecture, pre-training objectives and datasets to benchmark the performance of SUGARCREPE++ dataset. Experimental results highlight the difficulties of VLMs in distinguishing between lexical and semantic variations, particularly in object attributes and spatial relations. Although VLMs with larger pre-training datasets, model sizes, and multiple pre-training objectives achieve better performance on SUGARCREPE++, there is a significant opportunity for improvement. We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++, signifying that compositionality alone may not be sufficient for understanding semantic and lexical alterations. Given the importance of the property that the SUGARCREPE++ dataset targets, it serves as a new challenge to the vision-and-language community.
Related papers
- Improving vision-language alignment with graph spiking hybrid Networks [6.707524980629404]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features.
We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z) - Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models [18.87130615326443]
Vision-language models (VLMs) serve as foundation models for image captioning and text-to-image generation.
Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding.
arXiv Detail & Related papers (2024-12-11T05:37:04Z) - HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)
HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.
To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z) - MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete -language for multimodal understanding and generation.
The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
arXiv Detail & Related papers (2024-11-26T03:33:52Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.
Our key idea involves challenging the model to discern both matching and distinct elements.
We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations [13.608653575298183]
This paper introduces the VISLA benchmark, designed to evaluate the semantic and lexical understanding of language models.
An evaluation involving 34 vision-language models (VLMs) and 20 unimodal language models (ULMs) reveals surprising difficulties in distinguishing between lexical and semantic variations.
arXiv Detail & Related papers (2024-04-25T07:08:00Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Visualizing Linguistic Diversity of Text Datasets Synthesized by Large
Language Models [9.808214545408541]
LinguisticLens is a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of datasets.
It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples.
arXiv Detail & Related papers (2023-05-19T00:53:45Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Meta-Learning with Variational Semantic Memory for Word Sense
Disambiguation [56.830395467247016]
We propose a model of semantic memory for WSD in a meta-learning setting.
Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork.
We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce scenarios.
arXiv Detail & Related papers (2021-06-05T20:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.