Related papers: From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

URL: http://arxiv.org/abs/2505.13233v1
Date: Mon, 19 May 2025 15:15:37 GMT
Title: From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection
Authors: Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, Jian Liang,
Abstract summary: textbfABS achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks.<n>textbfABS is training-free and even rivals few-shot and test-time adaptation methods.
Score: 38.98491521357191
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an \textbf{A}ttention-\textbf{B}ased \textbf{S}election (\textbf{ABS}) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment. \textbf{ABS} achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably, \textbf{ABS} is training-free and even rivals few-shot and test-time adaptation methods. Our code is available at \href{https://github.com/BIT-DA/ABS}{\textcolor{darkgreen}{https://github.com/BIT-DA/ABS}}.

Related papers

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z)
Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model [56.573203512455706]
Large-scale vision-language models (VLMs) have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets.<n>One approach to address this issue is to develop interpretable models by integrating language.<n>We propose LaZSL, a locally-aligned vision-language model for interpretable ZSL.
arXiv Detail & Related papers (2025-06-30T13:14:46Z)
Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.<n>We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z)
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation. We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z)
Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning [23.96220607033524]
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL) It is trained to recognize multiple unseen classes within a sample based on seen classes and auxiliary knowledge. We propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties.
arXiv Detail & Related papers (2024-08-22T09:45:24Z)
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z)
GBE-MLZSL: A Group Bi-Enhancement Framework for Multi-Label Zero-Shot Learning [24.075034737719776]
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL) We propose a novel and effective group bi-enhancement framework for MLZSL, dubbed GBE-MLZSL, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.
arXiv Detail & Related papers (2023-09-02T12:07:21Z)
Towards Effective Image Manipulation Detection with Proposal Contrastive Learning [61.5469708038966]
We propose Proposal Contrastive Learning (PCL) for effective image manipulation detection. Our PCL consists of a two-stream architecture by extracting two types of global features from RGB and noise views respectively. Our PCL can be easily adapted to unlabeled data in practice, which can reduce manual labeling costs and promote more generalizable features.
arXiv Detail & Related papers (2022-10-16T13:30:13Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Semantically Grounded Visual Embeddings for Zero-Shot Learning [17.86691047421871]
We propose to learn semantically grounded and enriched visual information by computing a joint image and text model with a two-stream network on a proxy task. Our method, dubbed joint embeddings for zero-shot learning is evaluated on several benchmark datasets.
arXiv Detail & Related papers (2022-01-03T10:43:15Z)
Goal-Oriented Gaze Estimation for Zero-Shot Learning [62.52340838817908]
We introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization. We aim to predict the actual human gaze location to get the visual attention regions for recognizing a novel object guided by attribute description. This work implies the promising benefits of collecting human gaze dataset and automatic gaze estimation algorithms on high-level computer vision tasks.
arXiv Detail & Related papers (2021-03-05T02:14:57Z)
Zero-Shot Learning from scratch (ZFS): leveraging local compositional representations [25.449244103599106]
Zero-shot classification is a generalization task where no instance from the target classes is seen during training. To allow for test-time transfer, each class is annotated with semantic information, commonly in the form of attributes or text descriptions. The approaches that achieve the best absolute performance on image benchmarks rely on features extracted from encoders pretrained on Imagenet. We propose Zero-Shot Learning from scratch (ZFS), which explicitly forbids the use of encoders fine-tuned on other datasets.
arXiv Detail & Related papers (2020-10-22T23:11:18Z)
Attribute Prototype Network for Zero-Shot Learning [113.50220968583353]
We propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features. Our model points to the visual evidence of the attributes in an image, confirming the improved attribute localization ability of our image representation.
arXiv Detail & Related papers (2020-08-19T06:46:35Z)
Simple and effective localized attribute representations for zero-shot learning [48.053204004771665]
Zero-shot learning (ZSL) aims to discriminate images from unseen classes by exploiting relations to seen classes via their semantic descriptions. We propose localizing representations in the semantic/attribute space, with a simple but effective pipeline where localization is implicit. Our method can be implemented easily, which can be used as a new baseline for zero shot-learning.
arXiv Detail & Related papers (2020-06-10T16:46:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.