I2V3D: Controllable image-to-video generation with 3D guidance
- URL: http://arxiv.org/abs/2503.09733v1
- Date: Wed, 12 Mar 2025 18:26:34 GMT
- Title: I2V3D: Controllable image-to-video generation with 3D guidance
- Authors: Zhiyuan Zhang, Dongdong Chen, Jing Liao,
- Abstract summary: IV23D is a framework for animating static images into dynamic videos with precise 3D control.<n>Our approach combines the precision of a computer graphics pipeline with advanced generative models.
- Score: 42.23117201457898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.
Related papers
- 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models [12.949009540192389]
3DV-TON is a novel framework for generating high-fidelity and temporally consistent video try-on results.
Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance.
To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios.
arXiv Detail & Related papers (2025-04-24T10:12:40Z) - Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach [42.581066866708085]
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness.<n>To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space.<n>The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model.<n>We regularize the shape and motion of objects in the video to eliminate undesired artifacts.
arXiv Detail & Related papers (2025-02-05T21:49:06Z) - F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting [35.625593119642424]
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets.<n>We propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting.<n>We also introduce a self-supervised cycle-aggregative constraint to enforce cross-view consistency in the learned 3D representation.
arXiv Detail & Related papers (2025-01-12T04:44:44Z) - LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors [107.83398512719981]
Single-image 3D reconstruction remains a fundamental challenge in computer vision.<n>Recent advances in Latent Video Diffusion Models offer promising 3D priors learned from large-scale video data.<n>We propose LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency.
arXiv Detail & Related papers (2024-12-12T18:58:42Z) - DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses [57.17501809717155]
We present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs.
Our key insight is that human images naturally exhibit multiple levels of correlation.
We construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations.
arXiv Detail & Related papers (2024-11-30T08:42:13Z) - StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos [44.51044100125421]
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.
Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays.
arXiv Detail & Related papers (2024-09-11T17:52:07Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - PV3D: A 3D Generative Model for Portrait Video Generation [94.96025739097922]
We propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos.
PV3D is able to support many downstream applications such as animating static portraits and view-consistent video motion editing.
arXiv Detail & Related papers (2022-12-13T05:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.