Related papers: ChangingGrounding: 3D Visual Grounding in Changing Scenes

ChangingGrounding: 3D Visual Grounding in Changing Scenes

URL: http://arxiv.org/abs/2510.14965v1
Date: Thu, 16 Oct 2025 17:59:16 GMT
Title: ChangingGrounding: 3D Visual Grounding in Changing Scenes
Authors: Miao Hu, Zhiwei Huang, Tai Wang, Jiangmiao Pang, Dahua Lin, Nanning Zheng, Runsen Xu,
Abstract summary: Real-world robots localize objects from natural-language instructions while scenes around them keep changing.<n>Most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud.<n>We introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations.
Score: 92.00984845186679
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .

Related papers

Error-Driven Scene Editing for 3D Grounding in Large Language Models [71.41120775319088]
Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments.<n>This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources.<n>We propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases.
arXiv Detail & Related papers (2025-11-18T03:13:29Z)
T-3DGS: Removing Transient Objects for 3D Scene Reconstruction [83.05271859398779]
Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions.<n>We propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting.
arXiv Detail & Related papers (2024-11-29T07:45:24Z)
3DGS-CD: 3D Gaussian Splatting-based Change Detection for Physical Object Rearrangement [2.2122801766964795]
We present 3DGS-CD, the first 3D Gaussian Splatting (3DGS)-based method for detecting physical object rearrangements in 3D scenes.<n>Our approach estimates 3D object-level changes by comparing two sets of unaligned images taken at different times.<n>Our method can accurately identify changes in cluttered environments using sparse (as few as one) post-change images within as little as 18s.
arXiv Detail & Related papers (2024-11-06T07:08:41Z)
Improved Scene Landmark Detection for Camera Localization [11.56648898250606]
Method based on scene landmark detection (SLD) was recently proposed to address these limitations. It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks. We show that the accuracy gap was due to insufficient model capacity and noisy labels during training.
arXiv Detail & Related papers (2024-01-31T18:59:12Z)
What You See Is What You Detect: Towards better Object Densification in 3D detection [2.3436632098950456]
The widely-used full-shape completion approach actually leads to a higher error-upper bound especially for far away objects and small objects like pedestrians. We introduce a visible part completion method that requires only 11.3% of the prediction points that previous methods generate. To recover the dense representation, we propose a mesh-deformation-based method to augment the point set associated with visible foreground objects.
arXiv Detail & Related papers (2023-10-27T01:46:37Z)
CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework. Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene. In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
Dynamic 3D Scene Analysis by Point Cloud Accumulation [32.491921765128936]
Multi-beam LiDAR sensors are used on autonomous vehicles and mobile robots. Each frame covers the scene sparsely, due to limited angular scanning resolution and occlusion. We propose a method that exploits inductive biases of outdoor street scenes, including their geometric layout and object-level rigidity.
arXiv Detail & Related papers (2022-07-25T17:57:46Z)
LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence. Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence. We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z)
Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations. In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z)
Delving into Localization Errors for Monocular 3D Object Detection [85.77319416168362]
Estimating 3D bounding boxes from monocular images is an essential component in autonomous driving. In this work, we quantify the impact introduced by each sub-task and find the localization error' is the vital factor in restricting monocular 3D detection.
arXiv Detail & Related papers (2021-03-30T10:38:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.