Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
- URL: http://arxiv.org/abs/2511.21191v1
- Date: Wed, 26 Nov 2025 09:12:17 GMT
- Title: Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
- Authors: Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen,
- Abstract summary: We present NDTokenizer3D, a 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions.<n>The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation.<n>NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details.
- Score: 39.18256367776712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
Related papers
- Unified Semantic Transformer for 3D Scene Understanding [55.415468022487005]
We introduce UNITE, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model.<n>Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry.<n>We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models.
arXiv Detail & Related papers (2025-12-16T12:49:35Z) - GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond [56.677984098204696]
multimodal language models are driving the development of 3D Vision-Language Models (VLMs)<n>We propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations.<n>We present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images.
arXiv Detail & Related papers (2025-07-01T15:52:59Z) - Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [91.94869042117621]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning [18.185457833299235]
We propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously.<n>We first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features.<n>For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects.
arXiv Detail & Related papers (2025-03-01T14:38:42Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding [12.823274886850697]
We introduce a novel and efficient prompt tuning paradigm, 3DMIT.
This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information.
We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain.
arXiv Detail & Related papers (2024-01-06T12:20:18Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding,
Reasoning, and Planning [42.61001274381612]
We present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts.
Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Captioning and 3D Question Answering.
arXiv Detail & Related papers (2023-11-30T16:00:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.