State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2511.18012v1
- Date: Sat, 22 Nov 2025 10:25:19 GMT
- Title: State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection
- Authors: Jiaying Zhou, Qingchao Chen,
- Abstract summary: Existing semantic prototypes fail to capture the rich intra-class visual variations induced by different object states.<n>Standard pseudo-box generation introduces a semantic mismatch between visual region proposals and object-centric text embeddings.<n>We introduce State-Enhanced Semantic Prototypes (SESP) and Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch.
- Score: 23.788375360674063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-Vocabulary Object Detection (OVOD) aims to generalize object recognition to novel categories, while Weakly Supervised OVOD (WS-OVOD) extends this by combining box-level annotations with image-level labels. Despite recent progress, two critical challenges persist in this setting. First, existing semantic prototypes, even when enriched by LLMs, are static and limited, failing to capture the rich intra-class visual variations induced by different object states (e.g., a cat's pose). Second, the standard pseudo-box generation introduces a semantic mismatch between visual region proposals (which contain context) and object-centric text embeddings. To tackle these issues, we introduce two complementary prototype enhancement strategies. To capture intra-class variations in appearance and state, we propose the State-Enhanced Semantic Prototypes (SESP), which generates state-aware textual descriptions (e.g., "a sleeping cat") to capture diverse object appearances, yielding more discriminative prototypes. Building on this, we further introduce Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch. SAPP incorporates contextual semantics (e.g., "cat lying on sofa") and utilizes a soft alignment mechanism to promote contextually consistent visual-textual representations. By integrating SESP and SAPP, our method effectively enhances both the richness of semantic prototypes and the visual-textual alignment, achieving notable improvements.
Related papers
- Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding [11.244257545057508]
Prototype-Aware Multimodal Learning (PAML) is an innovative framework that addresses imperfect alignment between visual and linguistic modalities, insufficient cross-modal feature fusion, and ineffective utilization of semantic prototype information.<n>Our framework shows competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene.
arXiv Detail & Related papers (2025-09-08T02:27:10Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding [56.972490764212175]
Weakly supervised visual grounding aims to locate objects in images based on text descriptions.<n>Existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions.<n>We introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG.
arXiv Detail & Related papers (2025-08-05T08:16:35Z) - Semantic-Space-Intervened Diffusive Alignment for Visual Classification [11.621655970763467]
Cross-modal alignment is an effective approach to improving visual classification.<n>This paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA.<n> Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods.
arXiv Detail & Related papers (2025-05-09T01:41:23Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Context Disentangling and Prototype Inheriting for Robust Visual
Grounding [56.63007386345772]
Visual grounding (VG) aims to locate a specific target in an image based on a given language query.
We propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes.
Our method outperforms the state-of-the-art methods in both scenarios.
arXiv Detail & Related papers (2023-12-19T09:03:53Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.