Related papers: LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning

LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning

URL: http://arxiv.org/abs/2506.09935v2
Date: Fri, 26 Sep 2025 13:16:53 GMT
Title: LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
Authors: Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang,
Abstract summary: Key bottleneck is that current scene representations struggle to balance performance and efficiency.<n>We propose the condensed feature grid (CFG), an efficient scene representation featuring significantly reduced token overhead and strong perception capability.<n>We introduce LEO-VL, a 3D VLM trained on 700k 3D-VL data spanning four real-world indoor domains and five tasks such as captioning and dialogue.
Score: 63.19329995235114
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding goal in the 3D-VL community. Despite recent progress, 3D VLMs still fall short of their 2D counterparts in capability and robustness. A key bottleneck is that current scene representations struggle to balance performance and efficiency: competitive performance comes at the cost of heavy token overhead, which in turn hampers the scalability of 3D-VL learning. To address this, we propose the condensed feature grid (CFG), an efficient scene representation featuring significantly reduced token overhead and strong perception capability. Building on CFG, we introduce LEO-VL, a 3D VLM trained on 700k 3D-VL data spanning four real-world indoor domains and five tasks such as captioning and dialogue. To enhance the robustness of 3D VLM, we further propose SceneDPO for post-training, which involves contrasts across answers and scenes. LEO-VL achieves state-of-the-art performance on various 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Our extensive experiments highlight the efficiency of our representation, the benefit of task and scene diversity, consistent scaling effects, and the advantages of SceneDPO compared to SFT and GRPO. We hope our findings advance the efficiency, scalability, and robustness of future 3D VLMs.

Related papers

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding [21.70953326671503]
We present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT)<n>3D-RFT is first framework to extend RLVR to video-based 3D perception and reasoning.<n>We show that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks.
arXiv Detail & Related papers (2026-03-05T09:15:16Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding [11.069512983766783]
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks.<n>We propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs.<n>Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks.
arXiv Detail & Related papers (2025-07-31T11:59:06Z)
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z)
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z)
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation [61.21302433849139]
Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments.<n>We propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input to train 3D-VLM in navigation action prediction.<n>Our Dynam3D is capable of online encoding and localization of 3D instances, and dynamically updates them in changing environments to provide large-scale exploration and long-term memory capabilities for navigation.
arXiv Detail & Related papers (2025-05-16T15:46:27Z)
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis [65.42684641776931]
3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models.<n>We propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks.
arXiv Detail & Related papers (2025-03-28T13:32:29Z)
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding [19.382210260928776]
Video-3D LLM treats 3D scenes as dynamic videos and incorporates 3D position encoding into these representations.<n>Our model achieves state-of-the-art performance on several 3D scene understanding benchmarks.
arXiv Detail & Related papers (2024-11-30T14:28:53Z)
Unifying 3D Vision-Language Understanding via Promptable Queries [39.55438547712157]
unified model for 3D vision-language (3D-VL) understanding. PQ3D is capable of using Promptable Queries to tackle a wide range of 3D-VL tasks. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks.
arXiv Detail & Related papers (2024-05-19T04:35:05Z)
3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z)
An Embodied Generalist Agent in 3D World [67.16935110789528]
We introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.
arXiv Detail & Related papers (2023-11-18T01:21:38Z)
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [44.00343134325925]
3D-VisTA is a pre-trained Transformer for 3D Vision and Text Alignment. ScanScribe is the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training.
arXiv Detail & Related papers (2023-08-08T15:59:17Z)
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes [68.61199623705096]
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. We propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations.
arXiv Detail & Related papers (2023-06-04T11:08:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.