Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from
a Single Image
- URL: http://arxiv.org/abs/2308.10257v1
- Date: Sun, 20 Aug 2023 12:53:50 GMT
- Title: Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from
a Single Image
- Authors: Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao,
Guosheng Lin
- Abstract summary: We study the problem of synthesizing a long-term dynamic video from only a single image.
Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories.
We present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image.
- Score: 59.18564636990079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of synthesizing a long-term dynamic video from only a
single image. This is challenging since it requires consistent visual content
movements given large camera motions. Existing methods either hallucinate
inconsistent perpetual views or struggle with long camera trajectories. To
address these issues, it is essential to estimate the underlying 4D (including
3D geometry and scene motion) and fill in the occluded regions. To this end, we
present Make-It-4D, a novel method that can generate a consistent long-term
dynamic video from a single image. On the one hand, we utilize layered depth
images (LDIs) to represent a scene, and they are then unprojected to form a
feature point cloud. To animate the visual content, the feature point cloud is
displaced based on the scene flow derived from motion estimation and the
corresponding camera pose. Such 4D representation enables our method to
maintain the global consistency of the generated dynamic video. On the other
hand, we fill in the occluded regions by using a pretrained diffusion model to
inpaint and outpaint the input image. This enables our method to work under
large camera motions. Benefiting from our design, our method can be
training-free which saves a significant amount of training time. Experimental
results demonstrate the effectiveness of our approach, which showcases
compelling rendering results.
Related papers
- Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians [7.051077403685518]
Humans excel at forecasting the future dynamics of a scene given just a single image.<n>Video generation models that can mimic this ability are an essential component for intelligent systems.<n>Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation.
arXiv Detail & Related papers (2026-01-02T13:04:47Z) - SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting [83.5106058182799]
We introduce SEE4D, a pose-free, trajectory-to-camera framework for 4D world modeling from casual videos.<n>A view-conditional video in model is trained to learn a robust geometry prior to denoising realistically synthesized images.<n>We validate See4D on cross-view video generation and sparse reconstruction benchmarks.
arXiv Detail & Related papers (2025-10-30T17:59:39Z) - Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video [56.781766315691854]
We introduce textbfRestage4D, a geometry-preserving pipeline for video-conditioned 4D restaging.<n>We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance.
arXiv Detail & Related papers (2025-08-08T21:31:51Z) - AnimateScene: Camera-controllable Animation in Any Scene [34.04222775149215]
3D scene reconstruction and 4D human animation have seen rapid progress and broad adoption in recent years.<n>One key difficulty lies in placing the human at the correct location and scale within the scene.<n>Another challenge is that the human and the background may exhibit different lighting and style, leading to unrealistic composites.<n>We present AnimateScene, which addresses the above issues in a unified framework.
arXiv Detail & Related papers (2025-08-08T03:28:17Z) - Voyaging into Perpetual Dynamic Scenes from a Single View [31.85867311855001]
Key challenge is to ensure that different generated views be consistent with the underlying 3D motions.<n>We propose DynamicVoyager, which reformulates dynamic scene generation as a scene outpainting problem with new dynamic content.<n> Experiments show that our model can generate perpetual scenes with consistent motions along fly-through cameras.
arXiv Detail & Related papers (2025-07-05T22:49:25Z) - DreamJourney: Perpetual View Generation with Video Diffusion Models [91.88716097573206]
Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image.<n>Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement.<n>We present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task.
arXiv Detail & Related papers (2025-06-21T12:51:34Z) - Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images [5.754780404074765]
We propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image.
As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image.
arXiv Detail & Related papers (2025-04-04T06:51:39Z) - PaintScene4D: Consistent 4D Scene Generation from Text Prompts [29.075849524496707]
PaintScene4D is a novel text-to-4D scene generation framework.
It harnesses video generative models trained on diverse real-world datasets.
It produces realistic 4D scenes that can be viewed from arbitrary trajectories.
arXiv Detail & Related papers (2024-12-05T18:59:57Z) - GFlow: Recovering 4D World from Monocular Video [58.63051670458107]
We introduce GFlow, a framework that lifts a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time.
GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process.
GFlow transcends the boundaries of mere 4D reconstruction.
arXiv Detail & Related papers (2024-05-28T17:59:22Z) - Controllable Longer Image Animation with Diffusion Models [12.565739255499594]
We introduce an open-domain controllable image animation method using motion priors with video diffusion models.
Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos.
We propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks.
arXiv Detail & Related papers (2024-05-27T16:08:00Z) - DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action
Segmentation [39.806610397357986]
We present our findings from the research conducted on the Human-Object Interaction 4D (HOI4D) dataset for egocentric action segmentation task.
We convert point cloud videos into depth videos and employ traditional video modeling methods to improve 4D action segmentation.
The proposed method achieved the first place in the 4D Action Track of the HOI4D Challenge 2023.
arXiv Detail & Related papers (2023-07-31T16:14:24Z) - FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses
via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning.
Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion.
We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z) - 3D Cinemagraphy from a Single Image [73.09720823592092]
We present 3D Cinemagraphy, a new technique that marries 2D image animation with 3D photography.
Given a single still image as input, our goal is to generate a video that contains both visual content animation and camera motion.
arXiv Detail & Related papers (2023-03-10T06:08:23Z) - DynIBaR: Neural Dynamic Image-Based Rendering [79.44655794967741]
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene.
We adopt a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views.
We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets.
arXiv Detail & Related papers (2022-11-20T20:57:02Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Neural Radiance Flow for 4D View Synthesis and Video Processing [59.9116932930108]
We present a method to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images.
Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene.
arXiv Detail & Related papers (2020-12-17T17:54:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.