Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset
- URL: http://arxiv.org/abs/2508.11058v1
- Date: Thu, 14 Aug 2025 20:35:59 GMT
- Title: Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset
- Authors: Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, Yang Liu,
- Abstract summary: MV-ScanQA is a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views.<n>We present TripAlign, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M 2D view, set of 3D objects, text> triplets.<n>We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign.
- Score: 56.533371387182065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.
Related papers
- HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model [14.277165215664425]
Large vision-language models (VLMs) have shown significant promise for 3D scene understanding.<n>Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space.<n>We propose a novel hierarchical multimodal representation for 3D scene reasoning.
arXiv Detail & Related papers (2025-11-28T08:06:20Z) - 3D Question Answering via only 2D Vision-Language Models [87.41421075243103]
Large vision-language models (LVLMs) have advanced numerous fields.<n>We explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example.<n>Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question.<n>We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA.
arXiv Detail & Related papers (2025-05-28T09:04:39Z) - Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving [45.82124136705798]
DriveMonkey is a framework that seamlessly integrates Large Visual-Language Models with a spatial processor.<n>Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task.
arXiv Detail & Related papers (2025-05-13T16:36:51Z) - MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [91.94869042117621]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.<n>Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.<n>We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs [13.678235444299286]
Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space.<n>In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes.
arXiv Detail & Related papers (2025-03-17T12:34:22Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.