CamI2V: Camera-Controlled Image-to-Video Diffusion Model
- URL: http://arxiv.org/abs/2410.15957v2
- Date: Tue, 22 Oct 2024 06:26:45 GMT
- Title: CamI2V: Camera-Controlled Image-to-Video Diffusion Model
- Authors: Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, Xi Li,
- Abstract summary: In this paper, we emphasize the necessity of integrating explicit physical constraints into model design.
Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition.
We achieve a 25.5% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images.
- Score: 11.762824216082508
- License:
- Abstract: Recently, camera pose, as a user-friendly and physics-related condition, has been introduced into text-to-video diffusion model for camera control. However, existing methods simply inject camera conditions through a side input. These approaches neglect the inherent physical knowledge of camera pose, resulting in imprecise camera control, inconsistencies, and also poor interpretability. In this paper, we emphasize the necessity of integrating explicit physical constraints into model design. Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition. This ensures that features are aggregated from corresponding epipolar lines in all noised frames, overcoming the limitations of current attention mechanisms in tracking displaced features across frames, especially when features move significantly with the camera and become obscured by noise. Additionally, we introduce register tokens to handle cases without intersections between frames, commonly caused by rapid camera movements, dynamic objects, or occlusions. To support image-to-video, we propose the multiple guidance scale to allow for precise control for image, text, and camera, respectively. Furthermore, we establish a more robust and reproducible evaluation pipeline to solve the inaccuracy and instability of existing camera control measurement. We achieve a 25.5% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images. Only 24GB and 12GB are required for training and inference, respectively. We plan to release checkpoints, along with training and evaluation codes. Dynamic videos are best viewed at https://zgctroy.github.io/CamI2V.
Related papers
- RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control [10.939379611590333]
RealCam-I2V is a novel diffusion-based video generation framework.
It integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step.
During training, the reconstructed 3D scene enables scaling camera parameters from relative to absolute values.
RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images.
arXiv Detail & Related papers (2025-02-14T10:21:49Z) - FlexEvent: Event Camera Object Detection at Arbitrary Frequencies [45.82637829492951]
Event cameras offer unparalleled advantages for real-time perception in dynamic environments.
Existing event-based object detection methods are limited by fixed-frequency paradigms.
We propose FlexEvent, a novel event camera object detection framework that enables detection at arbitrary frequencies.
arXiv Detail & Related papers (2024-12-09T17:57:14Z) - Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training [51.851390459940646]
We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning.
Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the original model distribution.
Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds.
arXiv Detail & Related papers (2024-12-08T18:59:54Z) - AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers [66.29824750770389]
We analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation.
We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture.
arXiv Detail & Related papers (2024-11-27T18:49:13Z) - DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild [85.03973683867797]
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild.
We show that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
arXiv Detail & Related papers (2024-11-20T13:01:16Z) - Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering [54.468355408388675]
We build a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images.
We apply a diversity-based sampling algorithm to optimize the camera selection.
We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments.
arXiv Detail & Related papers (2024-09-11T08:36:49Z) - VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques.
We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step.
Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - Monitoring and Adapting the Physical State of a Camera for Autonomous
Vehicles [10.490646039938252]
We propose a generic and task-oriented self-health-maintenance framework for cameras based on data- and physically-grounded models.
We implement the framework on a real-world ground vehicle and demonstrate how a camera can adjust its parameters to counter a poor condition.
Our framework not only provides a practical ready-to-use solution to monitor and maintain the health of cameras, but can also serve as a basis for extensions to tackle more sophisticated problems.
arXiv Detail & Related papers (2021-12-10T11:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.