Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
- URL: http://arxiv.org/abs/2503.20776v2
- Date: Fri, 28 Mar 2025 04:48:48 GMT
- Title: Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
- Authors: Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, Achuta Kadambi,
- Abstract summary: We introduce Feature4X, a universal framework to extend functionality from 2D vision foundation model into the 4D realm.<n>The framework is first to distill and lift the features of video foundation models into an explicit 4D feature field using Splatting.<n>Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps.
- Score: 56.184278668305076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g., SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.
Related papers
- Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [48.8325946928959]
We introduce the first self-disentangled MAE for learning discriminative 4D representations in the pre-training stage.
We demonstrate that it can boost the fine-tuning performance on all 4D tasks, which we term Uni4D.
arXiv Detail & Related papers (2025-04-07T08:47:36Z) - Video4DGen: Enhancing Video and 4D Generation through Mutual Optimization [31.956858341885436]
Video4DGen is a novel framework that excels in generating 4D representations from single or multiple generated videos.
Video4DGen offers a powerful tool for applications in virtual reality, animation, and beyond.
arXiv Detail & Related papers (2025-04-05T12:13:05Z) - Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.
Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z) - Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene [122.42861221739123]
This paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning.<n>We propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes.
arXiv Detail & Related papers (2025-03-19T09:16:08Z) - 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [116.2042238179433]
In this paper, we frame dynamic scenes as unconstrained 4D volume learning problems.<n>We represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features.<n>This approach can capture relevant information in space and time by fitting the underlying photorealistic-temporal volume.<n> Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, novel views for complex dynamic scenes.
arXiv Detail & Related papers (2024-12-30T05:30:26Z) - CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models [98.03734318657848]
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video.<n>We leverage a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis.<n>We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks.
arXiv Detail & Related papers (2024-11-27T18:57:16Z) - Animate3D: Animating Any 3D Model with Multi-view Video Diffusion [47.05131487114018]
Animate3D is a novel framework for animating any static 3D model.
We introduce a framework combining reconstruction and 4D Score Distillation Sampling (4D-SDS) to leverage the multi-view video diffusion priors for animating 3D objects.
arXiv Detail & Related papers (2024-07-16T05:35:57Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.