CamI2V: Camera-Controlled Image-to-Video Diffusion Model
- URL: http://arxiv.org/abs/2410.15957v3
- Date: Wed, 04 Dec 2024 12:54:44 GMT
- Title: CamI2V: Camera-Controlled Image-to-Video Diffusion Model
- Authors: Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, Xi Li,
- Abstract summary: Integrated camera pose is a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control.
We identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability.
We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition.
- Score: 11.762824216082508
- License:
- Abstract: Recent advancements have integrated camera pose as a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control. In this paper, we identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability. We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition. Recognizing that noisy conditions provide deterministic information while also introducing randomness and potential misguidance due to added noise, we propose applying epipolar attention to only aggregate features along corresponding epipolar lines, thereby accessing an optimal amount of noisy conditions. Additionally, we address scenarios where epipolar lines disappear, commonly caused by rapid camera movements, dynamic objects, or occlusions, ensuring robust performance in diverse environments. Furthermore, we develop a more robust and reproducible evaluation pipeline to address the inaccuracies and instabilities of existing camera control metrics. Our method achieves a 25.64% improvement in camera controllability on the RealEstate10K dataset without compromising dynamics or generation quality and demonstrates strong generalization to out-of-domain images. Training and inference require only 24GB and 12GB of memory, respectively, for 16-frame sequences at 256x256 resolution. We will release all checkpoints, along with training and evaluation code. Dynamic videos are best viewed at https://zgctroy.github.io/CamI2V.
Related papers
- RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control [10.939379611590333]
RealCam-I2V is a novel diffusion-based video generation framework.
It integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step.
During training, the reconstructed 3D scene enables scaling camera parameters from relative to absolute values.
RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images.
arXiv Detail & Related papers (2025-02-14T10:21:49Z) - FlexEvent: Event Camera Object Detection at Arbitrary Frequencies [45.82637829492951]
Event cameras offer unparalleled advantages for real-time perception in dynamic environments.
Existing event-based object detection methods are limited by fixed-frequency paradigms.
We propose FlexEvent, a novel event camera object detection framework that enables detection at arbitrary frequencies.
arXiv Detail & Related papers (2024-12-09T17:57:14Z) - Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training [51.851390459940646]
We introduce Latent-Reframe, which enables camera control in a pre-trained video diffusion model without fine-tuning.
Latent-Reframe operates during the sampling stage, maintaining efficiency while preserving the original model distribution.
Our approach reframes the latent code of video frames to align with the input camera trajectory through time-aware point clouds.
arXiv Detail & Related papers (2024-12-08T18:59:54Z) - AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers [66.29824750770389]
We analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation.
We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture.
arXiv Detail & Related papers (2024-11-27T18:49:13Z) - DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild [85.03973683867797]
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild.
We show that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
arXiv Detail & Related papers (2024-11-20T13:01:16Z) - Redundancy-Aware Camera Selection for Indoor Scene Neural Rendering [54.468355408388675]
We build a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images.
We apply a diversity-based sampling algorithm to optimize the camera selection.
We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments.
arXiv Detail & Related papers (2024-09-11T08:36:49Z) - VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques.
We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step.
Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - Monitoring and Adapting the Physical State of a Camera for Autonomous
Vehicles [10.490646039938252]
We propose a generic and task-oriented self-health-maintenance framework for cameras based on data- and physically-grounded models.
We implement the framework on a real-world ground vehicle and demonstrate how a camera can adjust its parameters to counter a poor condition.
Our framework not only provides a practical ready-to-use solution to monitor and maintain the health of cameras, but can also serve as a basis for extensions to tackle more sophisticated problems.
arXiv Detail & Related papers (2021-12-10T11:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.