CamI2V: Camera-Controlled Image-to-Video Diffusion Model
- URL: http://arxiv.org/abs/2410.15957v2
- Date: Tue, 22 Oct 2024 06:26:45 GMT
- Title: CamI2V: Camera-Controlled Image-to-Video Diffusion Model
- Authors: Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, Xi Li,
- Abstract summary: In this paper, we emphasize the necessity of integrating explicit physical constraints into model design.
Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition.
We achieve a 25.5% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images.
- Score: 11.762824216082508
- License:
- Abstract: Recently, camera pose, as a user-friendly and physics-related condition, has been introduced into text-to-video diffusion model for camera control. However, existing methods simply inject camera conditions through a side input. These approaches neglect the inherent physical knowledge of camera pose, resulting in imprecise camera control, inconsistencies, and also poor interpretability. In this paper, we emphasize the necessity of integrating explicit physical constraints into model design. Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition. This ensures that features are aggregated from corresponding epipolar lines in all noised frames, overcoming the limitations of current attention mechanisms in tracking displaced features across frames, especially when features move significantly with the camera and become obscured by noise. Additionally, we introduce register tokens to handle cases without intersections between frames, commonly caused by rapid camera movements, dynamic objects, or occlusions. To support image-to-video, we propose the multiple guidance scale to allow for precise control for image, text, and camera, respectively. Furthermore, we establish a more robust and reproducible evaluation pipeline to solve the inaccuracy and instability of existing camera control measurement. We achieve a 25.5% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images. Only 24GB and 12GB are required for training and inference, respectively. We plan to release checkpoints, along with training and evaluation codes. Dynamic videos are best viewed at https://zgctroy.github.io/CamI2V.
Related papers
- Boosting Camera Motion Control for Video Diffusion Transformers [21.151900688555624]
We show that transformer-based diffusion models (DiT) suffer from severe degradation in camera motion accuracy.
To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), which boosts camera control by over 400%.
Our method universally applies to both U-Net and DiT models, offering improved camera control for video generation tasks.
arXiv Detail & Related papers (2024-10-14T17:58:07Z) - Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention [62.2447324481159]
Cavia is a novel framework for camera-controllable, multi-view video generation.
Our framework extends the spatial and temporal attention modules, improving both viewpoint and temporal consistency.
Cavia is the first of its kind that allows the user to specify distinct camera motion while obtaining object motion.
arXiv Detail & Related papers (2024-10-14T17:46:32Z) - Exploiting Motion Prior for Accurate Pose Estimation of Dashboard Cameras [17.010390107028275]
We propose a precise pose estimation method for dashcam images, leveraging the inherent camera motion prior.
Our method is 22% better than the baseline for pose estimation in AUC5textdegree, and it can estimate poses for 19% more images with less reprojection error.
arXiv Detail & Related papers (2024-09-27T11:59:00Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates.
Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation [117.16677556874278]
We introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation.
To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block.
Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models.
arXiv Detail & Related papers (2024-06-04T17:27:19Z) - CameraCtrl: Enabling Camera Control for Text-to-Video Generation [86.36135895375425]
Controllability plays a crucial role in video generation since it allows users to create desired content.
Existing models largely overlooked the precise control of camera pose that serves as a cinematic language.
We introduce CameraCtrl, enabling accurate camera pose control for text-to-video(T2V) models.
arXiv Detail & Related papers (2024-04-02T16:52:41Z) - VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques.
We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step.
Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z) - Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion [34.404342332033636]
We introduce Direct-a-Video, a system that allows users to independently specify motions for multiple objects as well as camera's pan and zoom movements.
For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters.
Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios.
arXiv Detail & Related papers (2024-02-05T16:30:57Z) - Monitoring and Adapting the Physical State of a Camera for Autonomous
Vehicles [10.490646039938252]
We propose a generic and task-oriented self-health-maintenance framework for cameras based on data- and physically-grounded models.
We implement the framework on a real-world ground vehicle and demonstrate how a camera can adjust its parameters to counter a poor condition.
Our framework not only provides a practical ready-to-use solution to monitor and maintain the health of cameras, but can also serve as a basis for extensions to tackle more sophisticated problems.
arXiv Detail & Related papers (2021-12-10T11:14:44Z) - FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction [70.09086274139504]
Multi-view algorithms strongly depend on camera parameters, in particular, the relative positions among the cameras.
We introduce FLEX, an end-to-end parameter-free multi-view model.
We demonstrate results on the Human3.6M and KTH Multi-view Football II datasets.
arXiv Detail & Related papers (2021-05-05T09:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.