Related papers: LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding

LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding

URL: http://arxiv.org/abs/2505.12253v1
Date: Sun, 18 May 2025 06:18:57 GMT
Title: LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
Authors: Hanyu Zhou, Gim Hee Lee,
Abstract summary: We propose a general LMM framework with atemporal prompt for visual representation 4D scene understanding.<n>The prompt is generated by encoding 3D position and 1D time into dynamic-aware 4D coordinate embedding.<n>Experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
Score: 55.81291976637705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite achieving significant progress in 2D image understanding, large multimodal models (LMMs) struggle in the physical world due to the lack of spatial representation. Typically, existing 3D LMMs mainly embed 3D positions as fixed spatial prompts within visual features to represent the scene. However, these methods are limited to understanding the static background and fail to capture temporally varying dynamic objects. In this paper, we propose LLaVA-4D, a general LMM framework with a novel spatiotemporal prompt for visual representation in 4D scene understanding. The spatiotemporal prompt is generated by encoding 3D position and 1D time into a dynamic-aware 4D coordinate embedding. Moreover, we demonstrate that spatial and temporal components disentangled from visual features are more effective in distinguishing the background from objects. This motivates embedding the 4D spatiotemporal prompt into these features to enhance the dynamic scene representation. By aligning visual spatiotemporal embeddings with language embeddings, LMMs gain the ability to understand both spatial and temporal characteristics of static background and dynamic objects in the physical world. Additionally, we construct a 4D vision-language dataset with spatiotemporal coordinate annotations for instruction fine-tuning LMMs. Extensive experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.

Related papers

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation [57.199352741915625]
In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes.<n>Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences.<n>We also contribute the DyMesh dataset, containing over 4M diverse dynamic mesh sequences with text annotations.
arXiv Detail & Related papers (2025-06-11T17:55:16Z)
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding [83.37551035659119]
There are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.<n>We introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding.
arXiv Detail & Related papers (2025-03-22T17:55:53Z)
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models [58.80200897869225]
We propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently.<n>4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions.<n>Our results demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
arXiv Detail & Related papers (2025-03-13T14:58:22Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [116.2042238179433]
In this paper, we frame dynamic scenes as unconstrained 4D volume learning problems.<n>We represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features.<n>This approach can capture relevant information in space and time by fitting the underlying photorealistic-temporal volume.<n> Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, novel views for complex dynamic scenes.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
4-LEGS: 4D Language Embedded Gaussian Splatting [12.699978393733309]
We show how to lift-temporal features to a 4D representation based on 3D Gaussianting.<n>This enables an interactive interface where the user cantemporally localize events in the video from text prompts.<n>We demonstrate our system on public 3D video datasets of people and animals performing various actions.
arXiv Detail & Related papers (2024-10-14T17:00:53Z)
Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z)
Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting [8.078460597825142]
Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. We propose to approximate the underlying-temporal rendering volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics.
arXiv Detail & Related papers (2023-10-16T17:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.