4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
- URL: http://arxiv.org/abs/2503.10437v2
- Date: Tue, 01 Apr 2025 03:10:36 GMT
- Title: 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
- Authors: Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Johannes Herter, Minghan Qin, Gao Huang, Hanspeter Pfister,
- Abstract summary: We propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently.<n>4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions.<n>Our results demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
- Score: 58.80200897869225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
Related papers
- MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation [87.30919771444117]
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning.
Recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation.
We introduce MLLM-For3D, a framework that transfers knowledge from 2D MLLMs to 3D scene understanding.
arXiv Detail & Related papers (2025-03-23T16:40:20Z) - 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding [83.37551035659119]
There are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.
We introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding.
arXiv Detail & Related papers (2025-03-22T17:55:53Z) - 4-LEGS: 4D Language Embedded Gaussian Splatting [12.699978393733309]
We show how to lift-temporal features to a 4D representation based on 3D Gaussianting.<n>This enables an interactive interface where the user cantemporally localize events in the video from text prompts.<n>We demonstrate our system on public 3D video datasets of people and animals performing various actions.
arXiv Detail & Related papers (2024-10-14T17:00:53Z) - PiTe: Pixel-Temporal Alignment for Large Video-Language Model [40.76776645042339]
Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature.
We propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property.
arXiv Detail & Related papers (2024-09-11T12:53:07Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - LangSplat: 3D Language Gaussian Splatting [42.16849512832556]
LangSplat constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces.
LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin.
arXiv Detail & Related papers (2023-12-26T15:14:37Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - LERF: Language Embedded Radiance Fields [35.925752853115476]
Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF.
LERFs learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays.
After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time.
arXiv Detail & Related papers (2023-03-16T17:59:20Z) - PLA: Language-Driven Open-Vocabulary 3D Scene Understanding [57.47315482494805]
Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space.
Recent breakthrough of 2D open-vocabulary perception is driven by Internet-scale paired image-text data with rich vocabulary concepts.
We propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D.
arXiv Detail & Related papers (2022-11-29T15:52:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.