Related papers: ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

URL: http://arxiv.org/abs/2512.17178v1
Date: Fri, 19 Dec 2025 02:36:51 GMT
Title: ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
Authors: Qi Zhang, Yuxu Chen, Lei Deng, Lili Shen,
Abstract summary: ABE-CLIP is a training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models.<n>We employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text.<n>By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity.
Score: 9.610261779024219
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.

Related papers

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization [82.31106470150844]
We introduce Omni-Attribute, the first open-vocabulary image attribute encoder to learn attribute-specific representations.<n>We use a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement.<n>The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation.
arXiv Detail & Related papers (2025-12-11T18:59:56Z)
What Makes You Unique? Attribute Prompt Composition for Object Re-Identification [70.67907354506278]
Object Re-IDentification aims to recognize individuals across non-overlapping camera views.<n>Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies.<n>We propose an Attribute Prompt Composition framework, which exploits textual semantics to jointly enhance discrimination and generalization.
arXiv Detail & Related papers (2025-09-23T07:03:08Z)
VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [78.73711446918814]
We propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve AGReID performance.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training [17.893694262999826]
This paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in Compositional Zero-Shot Learning (CZSL)<n>ULAO comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification.<n>The Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments.
arXiv Detail & Related papers (2024-12-10T03:41:20Z)
Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.<n>We finetune CLIP so that text descriptions of differences between images correspond to their difference in image embedding space.<n>Our approach yields significantly improved capabilities in ranking images by a certain attribute, and improved zeroshot classification performance on many downstream image classification tasks.
arXiv Detail & Related papers (2024-09-15T13:02:14Z)
Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search [19.610244285078483]
We propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images. We show that our proposed method significantly surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2024-06-06T03:34:42Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object) We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z)
Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition [23.748227536306295]
We propose to understand human attributes using video frames that can make full use of temporal information. We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
arXiv Detail & Related papers (2023-04-20T05:18:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.