Related papers: AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

URL: http://arxiv.org/abs/2508.03201v3
Date: Mon, 27 Oct 2025 15:43:20 GMT
Title: AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
Authors: Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe,
Abstract summary: Weakly supervised visual grounding aims to locate objects in images based on text descriptions.<n>Existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions.<n>We introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG.
Score: 56.972490764212175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.

Related papers

State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection [23.788375360674063]
Existing semantic prototypes fail to capture the rich intra-class visual variations induced by different object states.<n>Standard pseudo-box generation introduces a semantic mismatch between visual region proposals and object-centric text embeddings.<n>We introduce State-Enhanced Semantic Prototypes (SESP) and Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch.
arXiv Detail & Related papers (2025-11-22T10:25:19Z)
AttriPrompt: Dynamic Prompt Composition Learning for CLIP [41.37140060183439]
AttriPrompt is a novel framework that enhances and refines textual semantic representations.<n>We introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features.<n>Experiments demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting.
arXiv Detail & Related papers (2025-09-07T07:07:59Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
SemPT: Semantic Prompt Tuning for Vision-Language Models [46.02674444180396]
Vision-Language Models pre-trained on large amounts of image-text pairs offer a promising solution.<n>We introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge.<n>SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.
arXiv Detail & Related papers (2025-08-14T13:41:59Z)
Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval [23.472806734625774]
We propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR) to achieve precise image-text matching.<n>Based on the prompt paradigm, DCAR jointly optimize attribute and class features to enhance fine-grained representation learning.
arXiv Detail & Related papers (2025-08-06T02:44:08Z)
Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z)
Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.<n>We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.<n>Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z)
Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.<n>The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.<n>Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z)
Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP. We propose SDSGG, a scene-specific description based OVSGG framework. To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z)
Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification [8.139529179222844]
Category-Prompt Refined Feature Learning (CPRFL) is a novel approach for Long-Tailed Multi-Label image Classification. CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.
arXiv Detail & Related papers (2024-08-15T12:51:57Z)
Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions [35.20091752343433]
This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary. The two contexts hierarchically construct the precise description for a certain category, which is first roughly classifying a sample to the predicted category. The precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX)
arXiv Detail & Related papers (2024-07-23T12:53:38Z)
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.