Related papers: LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

URL: http://arxiv.org/abs/2412.01292v2
Date: Sun, 02 Feb 2025 11:49:25 GMT
Title: LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Authors: Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan,
Abstract summary: LSceneLLM is an adaptive framework that automatically identifies task-relevant areas.<n>A dense token selector examines the attention map of LLM to identify visual preferences for the instruction input.<n>An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information.
Score: 70.0873383646651
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large scene understanding and existing scene understanding benchmarks. Plunging our scene magnifier module into the existing 3D-VLMs also brings significant improvement.

Related papers

NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation [14.046423852723615]
We introduce a novel 3D Gaussian Splatting based hard visual prompting approach to generate diverse viewpoints around target objects. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts. This training-free strategy integrates seamlessly with prior hard visual prompts, enriching object-descriptive features.
arXiv Detail & Related papers (2025-04-20T14:39:27Z)
3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks. We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task. We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z)
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities. We propose a novel framework based on multimodal retrieval-augmented generation (RAG) RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning [24.162598399141785]
Scene-LLM is a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning.
arXiv Detail & Related papers (2024-03-18T01:18:48Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework. Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene. In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
HyperDet3D: Learning a Scene-conditioned 3D Object Detector [154.84798451437032]
We propose HyperDet3D to explore scene-conditioned prior knowledge for 3D object detection. Our HyperDet3D achieves state-of-the-art results on the 3D object detection benchmark of the ScanNet and SUN RGB-D datasets.
arXiv Detail & Related papers (2022-04-12T07:57:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.