ROOT: VLM based System for Indoor Scene Understanding and Beyond
- URL: http://arxiv.org/abs/2411.15714v1
- Date: Sun, 24 Nov 2024 04:51:24 GMT
- Title: ROOT: VLM based System for Indoor Scene Understanding and Beyond
- Authors: Yonghui Wang, Shi-Yong Chen, Zhenxing Zhou, Siyi Li, Haoran Li, Wengang Zhou, Houqiang Li,
- Abstract summary: ROOT is a VLM-based system designed to enhance the analysis of indoor scenes.
rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI.
- Score: 83.71252153660078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{https://github.com/harrytea/ROOT}.
Related papers
- Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph [0.0]
OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph.<n>The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects.<n>Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods.
arXiv Detail & Related papers (2025-07-16T10:47:12Z) - SpatialLM: Training Large Language Models for Structured Indoor Modeling [34.0957676434764]
SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs.<n>We collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes with ground-truth 3D annotations.<n>Our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection.
arXiv Detail & Related papers (2025-06-09T07:10:58Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - DSM: Building A Diverse Semantic Map for 3D Visual Grounding [4.89669292144966]
We propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks.
This method leverages multimodal large language models (VLMs) to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy.
Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art.
arXiv Detail & Related papers (2025-04-11T07:18:42Z) - Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Monocular Occupancy Prediction for Scalable Indoor Scenes [56.686307396496545]
We propose a novel method, named ISO, for predicting indoor scene occupancy using monocular images.
ISO harnesses the advantages of a pretrained depth model to achieve accurate depth predictions.
We also introduce Occ-ScanNet, a large-scale occupancy benchmark for indoor scenes.
arXiv Detail & Related papers (2024-07-16T13:50:40Z) - Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases [13.126239167800652]
We present a system for generating indoor scenes in response to text prompts.
The prompts are not limited to a fixed vocabulary of scene descriptions.
The objects in generated scenes are not restricted to a fixed set of object categories.
arXiv Detail & Related papers (2024-02-05T01:59:31Z) - SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics.
We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images.
By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z) - Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies [16.396336068724484]
This paper proposes an approach to build 3D scene graphs in arbitrary indoor and outdoor environments.
The hierarchy of concepts that describe an outdoor environment is more complex than for indoors.
The lack of training data prevents the straightforward application of learning-based tools used in indoor settings.
arXiv Detail & Related papers (2023-12-18T21:20:28Z) - Neural Implicit Dense Semantic SLAM [83.04331351572277]
We propose a novel RGBD vSLAM algorithm that learns a memory-efficient, dense 3D geometry, and semantic segmentation of an indoor scene in an online manner.
Our pipeline combines classical 3D vision-based tracking and loop closing with neural fields-based mapping.
Our proposed algorithm can greatly enhance scene perception and assist with a range of robot control problems.
arXiv Detail & Related papers (2023-04-27T23:03:52Z) - Indoor Scene Generation from a Collection of Semantic-Segmented Depth
Images [18.24156991697044]
We present a method for creating 3D indoor scenes with a generative model learned from semantic-segmented depth images.
Given a room with a specified size, our method automatically generates 3D objects in a room from a randomly sampled latent code.
Compared to existing methods, our method not only efficiently reduces the workload of modeling and acquiring 3D scenes for training, but also produces better object shapes.
arXiv Detail & Related papers (2021-08-20T06:22:49Z) - Walk2Map: Extracting Floor Plans from Indoor Walk Trajectories [23.314557741879664]
We present Walk2Map, a data-driven approach to generate floor plans from trajectories of a person walking inside the rooms.
Thanks to advances in data-driven inertial odometry, such minimalistic input data can be acquired from the IMU readings of consumer-level smartphones.
We train our networks using scanned 3D indoor models and apply them in a cascaded fashion on an indoor walk trajectory.
arXiv Detail & Related papers (2021-02-27T16:29:09Z) - Supervised Training of Dense Object Nets using Optimal Descriptors for
Industrial Robotic Applications [57.87136703404356]
Dense Object Nets (DONs) by Florence, Manuelli and Tedrake introduced dense object descriptors as a novel visual object representation for the robotics community.
In this paper we show that given a 3D model of an object, we can generate its descriptor space image, which allows for supervised training of DONs.
We compare the training methods on generating 6D grasps for industrial objects and show that our novel supervised training approach improves the pick-and-place performance in industry-relevant tasks.
arXiv Detail & Related papers (2021-02-16T11:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.