Related papers: Decoupled Global-Local Alignment for Improving Compositional Understanding

Decoupled Global-Local Alignment for Improving Compositional Understanding

URL: http://arxiv.org/abs/2504.16801v1
Date: Wed, 23 Apr 2025 15:20:53 GMT
Title: Decoupled Global-Local Alignment for Improving Compositional Understanding
Authors: Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang,
Abstract summary: Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities.<n>However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts.<n>We introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities.
Score: 22.14543435592755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA

Related papers

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models [52.343627275005026]
GIFT is a novel continual fine-tuning approach to overcome catastrophic forgetting in Vision-Language Models.<n>We employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data.<n>Our method consistently outperforms previous state-of-the-art approaches across various settings.
arXiv Detail & Related papers (2025-03-06T09:09:18Z)
Point Cloud Understanding via Attention-Driven Contrastive Learning [64.65145700121442]
Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms. PointACL is an attention-driven contrastive learning framework designed to address these limitations. Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions.
arXiv Detail & Related papers (2024-11-22T05:41:00Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations [43.484570564890866]
Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt. We present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations.
arXiv Detail & Related papers (2024-03-29T17:33:42Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs) Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z)
Understanding Zero-Shot Adversarial Robustness for Large-Scale Models [31.295249927085475]
We identify and explore the problem of emphadapting large-scale models for zero-shot adversarial robustness. We propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning. Our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets.
arXiv Detail & Related papers (2022-12-14T04:08:56Z)
Unifying Graph Contrastive Learning with Flexible Contextual Scopes [57.86762576319638]
We present a self-supervised learning method termed Unifying Graph Contrastive Learning with Flexible Contextual Scopes (UGCL for short) Our algorithm builds flexible contextual representations with contextual scopes by controlling the power of an adjacency matrix. Based on representations from both local and contextual scopes, distL optimises a very simple contrastive loss function for graph representation learning.
arXiv Detail & Related papers (2022-10-17T07:16:17Z)
ExCon: Explanation-driven Supervised Contrastive Learning for Image Classification [12.109442912963969]
We propose to leverage saliency-based explanation methods to create content-preserving masked augmentations for contrastive learning. Our novel explanation-driven supervised contrastive learning (ExCon) methodology critically serves the dual goals of encouraging nearby image embeddings to have similar content and explanation. We demonstrate that ExCon outperforms vanilla supervised contrastive learning in terms of classification, explanation quality, adversarial robustness as well as calibration of probabilistic predictions of the model in the context of distributional shift.
arXiv Detail & Related papers (2021-11-28T23:15:26Z)
FREE: Feature Refinement for Generalized Zero-Shot Learning [86.41074134041394]
Generalized zero-shot learning (GZSL) has achieved significant progress, with many efforts dedicated to overcoming the problems of visual-semantic domain gap and seen-unseen bias. Most existing methods directly use feature extraction models trained on ImageNet alone, ignoring the cross-dataset bias between ImageNet and GZSL benchmarks. We propose a simple yet effective GZSL method, termed feature refinement for generalized zero-shot learning (FREE) to tackle the above problem.
arXiv Detail & Related papers (2021-07-29T08:11:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.