Related papers: GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

URL: http://arxiv.org/abs/2506.21188v2
Date: Tue, 22 Jul 2025 07:31:10 GMT
Title: GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
Authors: Zijun Lin, Shuting He, Cheston Tan, Bihan Wen,
Abstract summary: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps.<n>Current 3D visual grounding methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step.<n>We propose GroundFlow -- a plug-in module for temporal reasoning on 3D point cloud sequential grounding.
Score: 26.430390282267062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here" and "the same" to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow -- a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.

Related papers

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D [68.23391872643268]
LOCATE 3D is a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp"<n>It operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices.
arXiv Detail & Related papers (2025-04-19T02:51:24Z)
Task-oriented Sequential Grounding and Navigation in 3D Scenes [33.740081195089964]
Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment.<n>In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes.<n>We present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes.
arXiv Detail & Related papers (2024-08-07T18:30:18Z)
TAPVid-3D: A Benchmark for Tracking Any Point in 3D [63.060421798990845]
We introduce a new benchmark, TAPVid-3D, for evaluating the task of Tracking Any Point in 3D. This benchmark will serve as a guidepost to improve our ability to understand precise 3D motion and surface deformation from monocular video.
arXiv Detail & Related papers (2024-07-08T13:28:47Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions [27.469346807311574]
Text-guided 3D visual grounding (T-3DVG) aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene. Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing.
arXiv Detail & Related papers (2024-06-09T13:52:12Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields [68.94868475824575]
This paper introduces a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations. We leverage the strong semantic prior within a 3D generative model to train a semantic decoder. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data.
arXiv Detail & Related papers (2023-11-18T21:58:28Z)
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases [35.18565109770112]
3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario. Results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.
arXiv Detail & Related papers (2022-07-05T05:50:12Z)
Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds. We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z)
H3D: Benchmark on Semantic Segmentation of High-Resolution 3D Point Clouds and textured Meshes from UAV LiDAR and Multi-View-Stereo [4.263987603222371]
This paper introduces a 3D dataset which is unique in three ways. It depicts the village of Hessigheim (Germany) henceforth referred to as H3D. It is designed for promoting research in the field of 3D data analysis on one hand and to evaluate and rank emerging approaches.
arXiv Detail & Related papers (2021-02-10T09:33:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.