PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
- URL: http://arxiv.org/abs/2509.04833v1
- Date: Fri, 05 Sep 2025 06:30:06 GMT
- Title: PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
- Authors: Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang,
- Abstract summary: PropVG is an end-to-end proposal-based framework that seamlessly integrates foreground object proposal generation with referential object comprehension.<n>We introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects.<n>We also design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets.
- Score: 23.54011217288122
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.
Related papers
- Reasoning-Driven Multimodal LLM for Domain Generalization [72.00754603114187]
We study the role of reasoning in domain generalization using DomainBed-Reasoning dataset.<n>We propose RD-MLDG, a framework with two components: MTCT (Multi-Task Cross-Training) and SARR (Self-Aligned Reasoning Regularization)<n>Experiments on standard DomainBed datasets demonstrate that RD-MLDG achieves complementary state-of-the-art performances.
arXiv Detail & Related papers (2026-02-27T08:10:06Z) - ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation [21.87321809019825]
Referring Expression (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions.<n>textbfmodel is a novel RES framework integrating textbfEntropy-textbfBased Point textbfDiscovery (textbfEBD) and textbfVision-textbfBased textbfReasoning (textbfVBR)<n>model implements a coarse-to
arXiv Detail & Related papers (2026-01-23T01:56:04Z) - From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking [24.903851740154433]
This paper introduces an explicit feature framework that enhances object discriminativeness across three perspectives.<n>Experiments demonstrate that SpongeBobA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT.
arXiv Detail & Related papers (2025-12-02T04:04:39Z) - A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z) - Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection [19.807828545088082]
Weakly supervised object detection (WSOD) has attracted significant attention in recent years.<n>We present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation.<n>We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods.
arXiv Detail & Related papers (2025-09-10T05:12:03Z) - Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation [61.37076111486196]
Ref-AVS aims to segment target objects in audible videos based on given reference expressions.<n>We propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process.<n>Ref-Thinker is a multimodal language model capable of reasoning over textual, visual, and auditory cues.
arXiv Detail & Related papers (2025-08-06T13:05:09Z) - CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation [37.96005100341482]
Generalized Referring Expression (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios.
Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification.
We propose a textbfCounting-Aware textbfHierarchical textbfDecoding framework (CoHD) for GRES.
arXiv Detail & Related papers (2024-05-24T15:53:59Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - PETDet: Proposal Enhancement for Two-Stage Fine-Grained Object Detection [26.843891792018447]
We present PETDet (Proposal Enhancement for Two-stage fine-grained object detection) to better handle the sub-tasks in two-stage FGOD methods.
An anchor-free Quality Oriented Proposal Network (QOPN) is proposed with dynamic label assignment and attention-based decomposition.
A novel Adaptive Recognition Loss (ARL) offers guidance for the R-CNN head to focus on high-quality proposals.
arXiv Detail & Related papers (2023-12-16T18:04:56Z) - Small Object Detection via Coarse-to-fine Proposal Generation and
Imitation Learning [52.06176253457522]
We propose a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning.
CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A.
arXiv Detail & Related papers (2023-08-18T13:13:09Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - Contrastive Proposal Extension with LSTM Network for Weakly Supervised
Object Detection [52.86681130880647]
Weakly supervised object detection (WSOD) has attracted more and more attention since it only uses image-level labels and can save huge annotation costs.
We propose a new method by comparing the initial proposals and the extension ones to optimize those initial proposals.
Experiments on PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has achieved the state-of-the-art results.
arXiv Detail & Related papers (2021-10-14T16:31:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.