OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding
- URL: http://arxiv.org/abs/2601.16538v1
- Date: Fri, 23 Jan 2026 08:17:57 GMT
- Title: OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding
- Authors: Zixian Liu, Zhaoxi Chen, Liang Pan, Ziwei Liu,
- Abstract summary: OnlineSI is a framework that can improve its spatial understanding of its surroundings given a video stream.<n>Our core idea is to maintain a finite spatial memory to retain past observations.<n>We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene.
- Score: 53.33067495235966
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the computation required for each inference does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy $F_1$-Score to mitigate ambiguity, and test our method on two representative datasets. Experiments demonstrate the effectiveness of our method, paving the way towards real-world embodied systems.
Related papers
- Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling [68.14113731953971]
This paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like imagination.<n>We show that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks.
arXiv Detail & Related papers (2025-12-01T16:01:41Z) - SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding [64.86119288520419]
multimodal language models struggle with spatial reasoning across time and space.<n>We present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators.<n>Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
arXiv Detail & Related papers (2025-11-06T18:53:31Z) - Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [75.45940282834327]
We introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs.<n>We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs.<n>Our approach employs a two-stage fine-tuning strategy, resulting in significant improvements across multiple tasks.
arXiv Detail & Related papers (2025-11-03T14:27:00Z) - How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM [39.65493154187172]
Large Language Models (LLMs) have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods.<n>We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams.
arXiv Detail & Related papers (2025-04-08T08:11:39Z) - Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI [10.335943413484815]
seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment.
We introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation.
We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time.
arXiv Detail & Related papers (2024-10-06T23:25:21Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [130.40123493752816]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.<n>Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)<n>It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Joint Supervised and Self-Supervised Learning for 3D Real-World
Challenges [16.328866317851187]
Point cloud processing and 3D shape understanding are challenging tasks for which deep learning techniques have demonstrated great potentials.
Here we consider several possible scenarios involving synthetic and real-world point clouds where supervised learning fails due to data scarcity and large domain gaps.
We propose to enrich standard feature representations by leveraging self-supervision through a multi-task model that can solve a 3D puzzle while learning the main task of shape classification or part segmentation.
arXiv Detail & Related papers (2020-04-15T23:34:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.