Related papers: Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

URL: http://arxiv.org/abs/2512.00677v1
Date: Sun, 30 Nov 2025 00:18:46 GMT
Title: Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
Authors: Dong In Lee, Hyungjun Doh, Seunggeun Chi, Runlin Duan, Sangpil Kim, Karthik Ramani,
Abstract summary: We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS.<n>Our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches.
Score: 21.55368174087611
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing. Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS. Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches. Project page for results and code: https://di-lee.github.io/dynamic-eDiTor/

Related papers

SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation [30.72482055095692]
SWiT-4D is a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation.<n> SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator.<n>It achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision.
arXiv Detail & Related papers (2025-12-11T17:54:31Z)
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time [54.67332582569525]
This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task.<n>Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time.<n>We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.
arXiv Detail & Related papers (2025-12-09T18:57:21Z)
Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image [88.71287865590273]
We introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories.<n>We propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D trajectories.<n>We then propose a 4D View Synthesis Module (4D-Vi) to render videos with arbitrary camera trajectories from 4D point track representations.
arXiv Detail & Related papers (2025-12-04T17:59:10Z)
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models [79.06910348413861]
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image.<n>Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion.
arXiv Detail & Related papers (2025-11-01T11:16:25Z)
Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation [3.1852855132066673]
Current approaches often struggle to maintain view consistency while handling complex scene dynamics.<n>This framework is the first to leverage both rich temporal priors video diffusion models and geometric awareness of the reconstruction models.<n>It significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.
arXiv Detail & Related papers (2025-08-11T08:55:47Z)
MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image [8.22464804794448]
We propose MVG4D, a novel framework that generates dynamic 4D content from a single still image.<n>At its core, MVG4D employs an image matrix module that synthesizes temporally coherent and spatially diverse multi-view images.<n>Our method effectively enhances temporal consistency, geometric fidelity, and visual realism, addressing key challenges in motion discontinuity and background degradation.
arXiv Detail & Related papers (2025-07-24T12:48:14Z)
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z)
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis [60.853577108780414]
Existing 4D generation methods can generate high-quality 4D objects or scenes based on user-friendly conditions. We propose Trans4D, a novel text-to-4D synthesis framework that enables realistic complex scene transitions. In experiments, Trans4D consistently outperforms existing state-of-the-art methods in generating 4D scenes with accurate and high-quality transitions.
arXiv Detail & Related papers (2024-10-09T17:56:03Z)
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
We present textbf4DGen, a novel framework for grounded 4D content creation.<n>Our pipeline facilitates controllable 4D generation, enabling users to specify the motion via monocular video or adopt image-to-video generations.<n>Compared to existing video-to-4D baselines, our approach yields superior results in faithfully reconstructing input signals.
arXiv Detail & Related papers (2023-12-28T18:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.