Improving Generalized Visual Grounding with Instance-aware Joint Learning
- URL: http://arxiv.org/abs/2509.13747v1
- Date: Wed, 17 Sep 2025 07:00:51 GMT
- Title: Improving Generalized Visual Grounding with Instance-aware Joint Learning
- Authors: Ming Dai, Wenxuan Cheng, Jiang-Jiang Liu, Lingfeng Yang, Zhenhua Feng, Wankou Yang, Jingdong Wang,
- Abstract summary: Generalized visual grounding tasks are designed to accommodate multi-target and non-target scenarios.<n>We propose InstanceVG, a framework equipped with instance-aware capabilities to tackle both GREC and GRES.<n>To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching.
- Score: 45.53531162436934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.
Related papers
- Cross-view Domain Generalization via Geometric Consistency for LiDAR Semantic Segmentation [12.10021698723751]
Domain-generalized LiDAR semantic segmentation (LSS) seeks to train models on source-domain point clouds that generalize reliably to multiple unseen target domains.<n>Existing approaches assume similar acquisition views and struggle in cross-view scenarios.<n>We formulate cross-view domain generalization for LiDAR semantic segmentation and propose a novel framework, termed CVGC.
arXiv Detail & Related papers (2026-02-16T07:19:46Z) - Segment Any Events with Language [68.05185562243356]
We introduce SEAL, the first Semantic-aware Any Events framework that addresses Open-Vocabulary Event Instance (OV-EIS)<n>Given the visual prompt, our model presents a unified framework to support both segmentation event and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level.<n>Our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture.
arXiv Detail & Related papers (2026-01-30T16:42:56Z) - CountZES: Counting via Zero-Shot Exemplar Selection [22.69910219820086]
We propose CountZES, a training-free framework for object counting via zero-shot exemplar selection.<n>CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE)
arXiv Detail & Related papers (2025-12-18T11:12:50Z) - Tracking and Segmenting Anything in Any Modality [75.32774085793498]
We propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input.<n> SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
arXiv Detail & Related papers (2025-11-22T09:09:22Z) - Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension [46.07415235144545]
We address the challenging task of Generalized Referring Expression (GREC)<n>Existing REC methods face challenges in handling the complex cases encountered in GREC.<n>We propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G)
arXiv Detail & Related papers (2025-01-02T18:57:59Z) - General and Task-Oriented Video Segmentation [60.58054218592606]
We present GvSeg, a general video segmentation framework for addressing four different video segmentation tasks.
GvSeg provides a holistic disentanglement and modeling for segment targets, thoroughly examining them from the perspective of appearance, position, and shape.
Extensive experiments on seven gold-standard benchmark datasets demonstrate that GvSeg surpasses all existing specialized/general solutions.
arXiv Detail & Related papers (2024-07-09T04:21:38Z) - Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation [15.414518995812754]
Novel Instance Detection and computation (NIDS) aims at detecting and segmenting novel object instances.<n>We propose a unified, simple, yet effective framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment.
arXiv Detail & Related papers (2024-05-28T06:16:57Z) - CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation [37.96005100341482]
Generalized Referring Expression (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios.
Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification.
We propose a textbfCounting-Aware textbfHierarchical textbfDecoding framework (CoHD) for GRES.
arXiv Detail & Related papers (2024-05-24T15:53:59Z) - MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis [22.27724733876081]
We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image.
We introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task.
To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline.
arXiv Detail & Related papers (2024-02-08T04:52:36Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - Universal Instance Perception as Object Discovery and Retrieval [90.96031157557806]
UNI reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm.
It can flexibly perceive different types of objects by simply changing the input prompts.
UNI shows superior performance on 20 challenging benchmarks from 10 instance-level tasks.
arXiv Detail & Related papers (2023-03-12T14:28:24Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.