Related papers: ROOT: VLM based System for Indoor Scene Understanding and Beyond

ROOT: VLM based System for Indoor Scene Understanding and Beyond

URL: http://arxiv.org/abs/2411.15714v1
Date: Sun, 24 Nov 2024 04:51:24 GMT
Title: ROOT: VLM based System for Indoor Scene Understanding and Beyond
Authors: Yonghui Wang, Shi-Yong Chen, Zhenxing Zhou, Siyi Li, Haoran Li, Wengang Zhou, Houqiang Li,
Abstract summary: ROOT is a VLM-based system designed to enhance the analysis of indoor scenes. rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI.
Score: 83.71252153660078
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{https://github.com/harrytea/ROOT}.

Related papers

DSM: Building A Diverse Semantic Map for 3D Visual Grounding [4.89669292144966]
We propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks. This method leverages multimodal large language models (VLMs) to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy. Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art.
arXiv Detail & Related papers (2025-04-11T07:18:42Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z)
Monocular Occupancy Prediction for Scalable Indoor Scenes [56.686307396496545]
We propose a novel method, named ISO, for predicting indoor scene occupancy using monocular images. ISO harnesses the advantages of a pretrained depth model to achieve accurate depth predictions. We also introduce Occ-ScanNet, a large-scale occupancy benchmark for indoor scenes.
arXiv Detail & Related papers (2024-07-16T13:50:40Z)
Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases [13.126239167800652]
We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions. The objects in generated scenes are not restricted to a fixed set of object categories.
arXiv Detail & Related papers (2024-02-05T01:59:31Z)
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z)
Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies [16.396336068724484]
This paper proposes an approach to build 3D scene graphs in arbitrary indoor and outdoor environments. The hierarchy of concepts that describe an outdoor environment is more complex than for indoors. The lack of training data prevents the straightforward application of learning-based tools used in indoor settings.
arXiv Detail & Related papers (2023-12-18T21:20:28Z)
Neural Implicit Dense Semantic SLAM [83.04331351572277]
We propose a novel RGBD vSLAM algorithm that learns a memory-efficient, dense 3D geometry, and semantic segmentation of an indoor scene in an online manner. Our pipeline combines classical 3D vision-based tracking and loop closing with neural fields-based mapping. Our proposed algorithm can greatly enhance scene perception and assist with a range of robot control problems.
arXiv Detail & Related papers (2023-04-27T23:03:52Z)
Indoor Scene Generation from a Collection of Semantic-Segmented Depth Images [18.24156991697044]
We present a method for creating 3D indoor scenes with a generative model learned from semantic-segmented depth images. Given a room with a specified size, our method automatically generates 3D objects in a room from a randomly sampled latent code. Compared to existing methods, our method not only efficiently reduces the workload of modeling and acquiring 3D scenes for training, but also produces better object shapes.
arXiv Detail & Related papers (2021-08-20T06:22:49Z)
Walk2Map: Extracting Floor Plans from Indoor Walk Trajectories [23.314557741879664]
We present Walk2Map, a data-driven approach to generate floor plans from trajectories of a person walking inside the rooms. Thanks to advances in data-driven inertial odometry, such minimalistic input data can be acquired from the IMU readings of consumer-level smartphones. We train our networks using scanned 3D indoor models and apply them in a cascaded fashion on an indoor walk trajectory.
arXiv Detail & Related papers (2021-02-27T16:29:09Z)
Supervised Training of Dense Object Nets using Optimal Descriptors for Industrial Robotic Applications [57.87136703404356]
Dense Object Nets (DONs) by Florence, Manuelli and Tedrake introduced dense object descriptors as a novel visual object representation for the robotics community. In this paper we show that given a 3D model of an object, we can generate its descriptor space image, which allows for supervised training of DONs. We compare the training methods on generating 6D grasps for industrial objects and show that our novel supervised training approach improves the pick-and-place performance in industry-relevant tasks.
arXiv Detail & Related papers (2021-02-16T11:40:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.