Related papers: AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding

AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding

URL: http://arxiv.org/abs/2508.03201v1
Date: Tue, 05 Aug 2025 08:16:35 GMT
Title: AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding
Authors: Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe,
Abstract summary: Weakly supervised visual grounding aims to locate objects in images based on text descriptions.<n>Existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions.<n>We introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG.
Score: 51.74170851840497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

Related papers

Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval [23.472806734625774]
We propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR) to achieve precise image-text matching.<n>Based on the prompt paradigm, DCAR jointly optimize attribute and class features to enhance fine-grained representation learning.
arXiv Detail & Related papers (2025-08-06T02:44:08Z)
Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z)
Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.<n>We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.<n>Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z)
Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.<n>The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.<n>Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z)
Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP. We propose SDSGG, a scene-specific description based OVSGG framework. To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z)
Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification [8.139529179222844]
Category-Prompt Refined Feature Learning (CPRFL) is a novel approach for Long-Tailed Multi-Label image Classification. CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.
arXiv Detail & Related papers (2024-08-15T12:51:57Z)
Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions [35.20091752343433]
This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary. The two contexts hierarchically construct the precise description for a certain category, which is first roughly classifying a sample to the predicted category. The precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX)
arXiv Detail & Related papers (2024-07-23T12:53:38Z)
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.