Related papers: Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

URL: http://arxiv.org/abs/2509.06291v1
Date: Mon, 08 Sep 2025 02:27:10 GMT
Title: Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Authors: Jiangnan Xie, Xiaolong Zheng, Liang Zheng,
Abstract summary: Prototype-Aware Multimodal Learning (PAML) is an innovative framework that addresses imperfect alignment between visual and linguistic modalities, insufficient cross-modal feature fusion, and ineffective utilization of semantic prototype information.<n>Our framework shows competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene.
Score: 11.244257545057508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.

Related papers

State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection [23.788375360674063]
Existing semantic prototypes fail to capture the rich intra-class visual variations induced by different object states.<n>Standard pseudo-box generation introduces a semantic mismatch between visual region proposals and object-centric text embeddings.<n>We introduce State-Enhanced Semantic Prototypes (SESP) and Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch.
arXiv Detail & Related papers (2025-11-22T10:25:19Z)
Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Image Segmentation [66.82598255715696]
Federated learning enables multiple medical institutions to train a global model without sharing data.<n>Current approaches primarily focus on final-layer features, overlooking critical multi-level cues.<n>We propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment.
arXiv Detail & Related papers (2025-11-14T04:15:34Z)
Text-guided Visual Prompt DINO for Generic Segmentation [31.33676182634522]
We propose Prompt-DINO, a text-guided visual Prompt DINO framework.<n>First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features.<n>Second, we design order-aligned query selection for DETR-based architectures.<n>Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model.
arXiv Detail & Related papers (2025-08-08T09:09:30Z)
MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering [5.503514317063399]
Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances.<n>We introduce MQADet, a universal paradigm for enhancing existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models.<n>We design a novel three-stage Multimodal Question Answering (MQA) pipeline to guide the MLLMs to precisely localize complex textual and visual targets.
arXiv Detail & Related papers (2025-02-23T07:59:39Z)
Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction. Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z)
Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image. UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Multi-Modal Prototypes for Open-World Semantic Segmentation [37.84805778548119]
We propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for semantic segmentation. We decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes. Based on an elastic mask prediction module, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture.
arXiv Detail & Related papers (2023-07-05T03:27:31Z)
Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD) We adopt a standard two-stage object detector architecture. We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.