Related papers: Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features

Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features

URL: http://arxiv.org/abs/2502.08377v2
Date: Wed, 12 Mar 2025 02:49:03 GMT
Title: Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features
Authors: Liying Yang, Chen Liu, Zhenwei Zhu, Ajian Liu, Hui Ma, Jian Nong, Yanyan Liang,
Abstract summary: We propose a dynamic-static feature decoupling module (DSFD) to enhance dynamic representations.<n>We acquire decoupled features driven by dynamic features and current frame features.<n>Along spatial axes, it adaptively selects similar information of dynamic regions.<n>Our method achieves state-of-the-art (SOTA) results in video-to-4D.
Score: 14.03066701768256
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the regions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a real-world scenario dataset demonstrate its effectiveness on the 4D scene. Our code will be publicly available.

Related papers

Embracing Dynamics: Dynamics-aware 4D Gaussian Splatting SLAM [0.0]
D4DGS-SLAM is the first SLAM based on 4DGS map representation for dynamic environments. By incorporating the temporal dimension into scene representation, D4DGS-SLAM enables high-quality reconstruction of dynamic scenes. We show that our method outperforms state-of-the-art approaches in both camera pose tracking and map quality.
arXiv Detail & Related papers (2025-04-07T08:56:35Z)
Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting [16.15871890842964]
We propose Dual-Hierarchical Optimization (DHO), which consists of Hierarchical Gaussian Flow and Hierarchical Gaussian Guidance. Our method consistently outperforms the baselines on both synthetic and real-world datasets.
arXiv Detail & Related papers (2025-03-25T03:46:13Z)
SDD-4DGS: Static-Dynamic Aware Decoupling in Gaussian Splatting for 4D Scene Reconstruction [21.822062121612166]
This work introduces SDD-4DGS, the first framework for static-dynamic decoupled 4D scene reconstruction based on Gaussian Splatting. Our approach is built upon a novel probabilistic dynamic perception coefficient that is naturally integrated into the Gaussian reconstruction pipeline. Experiments on five benchmark datasets demonstrate that SDD-4DGS consistently outperforms state-of-the-art methods in reconstruction fidelity.
arXiv Detail & Related papers (2025-03-12T12:25:58Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [116.2042238179433]
In this paper, we frame dynamic scenes as unconstrained 4D volume learning problems.<n>We represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features.<n>This approach can capture relevant information in space and time by fitting the underlying photorealistic-temporal volume.<n> Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, novel views for complex dynamic scenes.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos [101.48581851337703]
We present BTimer, the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes.<n>Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target ('bullet') timestamp by aggregating information from all the context frames.<n>Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets.
arXiv Detail & Related papers (2024-12-04T18:15:06Z)
Urban4D: Semantic-Guided 4D Gaussian Splatting for Urban Scene Reconstruction [86.4386398262018]
Urban4D is a semantic-guided decomposition strategy inspired by advances in deep 2D semantic map generation.<n>Our approach distinguishes potentially dynamic objects through reliable semantic Gaussians.<n>Experiments on real-world datasets demonstrate that Urban4D achieves comparable or better quality than previous state-of-the-art methods.
arXiv Detail & Related papers (2024-12-04T16:59:49Z)
DENSER: 3D Gaussians Splatting for Scene Reconstruction of Dynamic Urban Environments [0.0]
We propose DENSER, a framework that significantly enhances the representation of dynamic objects. The proposed approach significantly outperforms state-of-the-art methods by a wide margin.
arXiv Detail & Related papers (2024-09-16T07:11:58Z)
Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation. HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model. We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z)
SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting [7.553079256251747]
We extend 3D Gaussian Splatting to reconstruct dynamic scenes. We produce high-quality renderings of general dynamic scenes with competitive quantitative performance. Our method can be viewed in real-time in our dynamic interactive viewer.
arXiv Detail & Related papers (2023-12-20T03:54:03Z)
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [85.17951804790515]
EmerNeRF is a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. It simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. Our method achieves state-of-the-art performance in sensor simulation.
arXiv Detail & Related papers (2023-11-03T17:59:55Z)
Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis [58.5779956899918]
We present a method that simultaneously addresses the tasks of dynamic scene novel-view synthesis and six degree-of-freedom (6-DOF) tracking of all dense scene elements. We follow an analysis-by-synthesis framework, inspired by recent work that models scenes as a collection of 3D Gaussians. We demonstrate a large number of downstream applications enabled by our representation, including first-person view synthesis, dynamic compositional scene synthesis, and 4D video editing.
arXiv Detail & Related papers (2023-08-18T17:59:21Z)
Efficient 3D Reconstruction, Streaming and Visualization of Static and Dynamic Scene Parts for Multi-client Live-telepresence in Large-scale Environments [6.543101569579952]
We aim at sharing 3D live-telepresence experiences in large-scale environments beyond room scale with both static and dynamic scene entities. Our system is able to achieve VR-based live-telepresence at close to real-time rates.
arXiv Detail & Related papers (2022-11-25T18:59:54Z)
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding [68.96574451918458]
We propose a framework named STVG, which models visual-linguistic dependencies with a static branch and a dynamic branch. Both the static and dynamic branches are designed as cross-modal transformers. Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG of the Person in Context Challenge.
arXiv Detail & Related papers (2022-07-06T15:48:58Z)
D$^2$NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video [23.905013304668426]
Given a monocular video, segmenting and decoupling dynamic objects while recovering the static environment is a widely studied problem in machine intelligence. We introduce Decoupled Dynamic Neural Radiance Field (D$2$NeRF), a self-supervised approach that takes a monocular video and learns a 3D scene representation.
arXiv Detail & Related papers (2022-05-31T14:41:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.