Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
- URL: http://arxiv.org/abs/2506.17545v1
- Date: Sat, 21 Jun 2025 02:29:10 GMT
- Title: Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
- Authors: Zhihao Yuan, Shuyi Jiang, Chun-Mei Feng, Yaolun Zhang, Shuguang Cui, Zhen Li, Na Zhao,
- Abstract summary: Scene-R1, a video-grounded framework, learns to reason about 3D scenes without any point-wise 3D instance supervision.<n> Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video.
- Score: 37.209795186399326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes without any point-wise 3D instance supervision by pairing reinforcement-learning-driven reasoning with a two-stage grounding pipeline. In the temporal grounding stage, we explicitly reason about the video and select the video snippets most relevant to an open-ended query. In the subsequent image grounding stage, we analyze the image and predict the 2D bounding box. After that, we track the object using SAM2 to produce pixel-accurate masks in RGB frames, and project them back into 3D, thereby eliminating the need for 3D detector-based proposals while capturing fine geometry and material cues. Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video. Our training pipeline only needs task-level 2D boxes or textual labels without dense 3D point-wise labels. Scene-R1 surpasses existing open-vocabulary baselines on multiple datasets, while delivering transparent, step-by-step rationales. These results show that reinforcement-learning-based reasoning combined with RGB-D video alone offers a practical, annotation-efficient route to trustworthy 3D scene understanding.
Related papers
- Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z) - 3D Question Answering via only 2D Vision-Language Models [87.41421075243103]
Large vision-language models (LVLMs) have advanced numerous fields.<n>We explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example.<n>Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question.<n>We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA.
arXiv Detail & Related papers (2025-05-28T09:04:39Z) - Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space [10.49905491984899]
We redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding.<n>We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels.<n>We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy.
arXiv Detail & Related papers (2024-08-14T09:50:02Z) - SceneGPT: A Language Model for 3D Scene Understanding [0.9054540533394926]
We present SceneGPT, an LLM based scene understanding system which can perform 3D spatial reasoning without training or explicit 3D supervision.
Key components of our framework are - 1) a 3D scene graph, that serves as scene representation, encoding the objects in the scene and their spatial relationships 2) a pre-trained LLM that can be adapted with in context learning for 3D spatial reasoning.
arXiv Detail & Related papers (2024-08-13T14:26:30Z) - General Geometry-aware Weakly Supervised 3D Object Detection [62.26729317523975]
A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes.
Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation.
arXiv Detail & Related papers (2024-07-18T17:52:08Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.<n>Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.<n>During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes.
Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z) - Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth.
Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories.
Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z) - SAT: 2D Semantics Assisted Training for 3D Visual Grounding [95.84637054325039]
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images.
We propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning.
arXiv Detail & Related papers (2021-05-24T17:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.