Related papers: GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

URL: http://arxiv.org/abs/2503.10596v2
Date: Mon, 21 Apr 2025 14:25:51 GMT
Title: GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Authors: Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang,
Abstract summary: GroundingSuite aims to bridge the gap between vision and language modalities.<n>It comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images.
Score: 39.967352995143855
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than the GLaMM.

Related papers

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels [30.722073025794025]
We address five critical real-world challenges in text-instruction-based grounding.<n>Our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations.<n>Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks.
arXiv Detail & Related papers (2025-05-20T00:37:19Z)
A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models [48.361839372110246]
We develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting.<n>We evaluate 19 large language models and uncover substantial variation in performance across constraint forms.<n>In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters.
arXiv Detail & Related papers (2025-05-12T14:16:55Z)
NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving [7.007334645975593]
We introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs with precise localization abilities of specialist detection models.
arXiv Detail & Related papers (2025-03-28T13:55:16Z)
Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments [4.106846770364469]
Off-road environments pose significant perception challenges for high-speed autonomous navigation. We propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (500 images), sparse and coarsely labeled (30% pixels) multi-biome dataset. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map.
arXiv Detail & Related papers (2024-11-10T23:52:24Z)
PointViG: A Lightweight GNN-based Model for Efficient Point Cloud Analysis [42.187844778761935]
This study introduces b>Pointb> b>Vib>sion b>Gb>NN (PointViG), an efficient framework for point cloud analysis. PointViG incorporates a lightweight graph convolutional module to efficiently aggregate local features. Experiments demonstrate that PointViG achieves performance comparable to state-of-the-art models.
arXiv Detail & Related papers (2024-07-01T02:55:45Z)
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset [13.808860456901204]
We introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level. We present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total.
arXiv Detail & Related papers (2024-04-23T02:06:10Z)
ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic Segmentation [0.5277756703318045]
ECLAIR is a new outdoor large-scale aerial LiDAR dataset designed specifically for advancing research in point cloud semantic segmentation. The dataset covers a total area of 10$km2$ with close to 600 million points and features eleven distinct object categories. The dataset is engineered to move forward the fields of 3D urban modeling, scene understanding, and utility infrastructure management.
arXiv Detail & Related papers (2024-04-16T16:16:40Z)
Optimization Efficient Open-World Visual Region Recognition [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model. Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z)
Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization. We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain. It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z)
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z)
MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images. A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z)
G-RCN: Optimizing the Gap between Classification and Localization Tasks for Object Detection [3.620272428985414]
We show that sharing high-level features for the classification and localization tasks is sub-optimal. We propose a paradigm called Gap-optimized region based convolutional network (G-RCN) The new method is applied on the Faster R-CNN with backbone of VGG16,ResNet50 and ResNet101.
arXiv Detail & Related papers (2020-11-14T04:14:01Z)
Grounded Situation Recognition [56.18102368133022]
We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities. We show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval.
arXiv Detail & Related papers (2020-03-26T17:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.