FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video
Synthesis
- URL: http://arxiv.org/abs/2312.17681v1
- Date: Fri, 29 Dec 2023 16:57:12 GMT
- Title: FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video
Synthesis
- Authors: Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan
Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana
Marculescu
- Abstract summary: Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos.
This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video.
- Score: 66.2611385251157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have transformed the image-to-image (I2I) synthesis and are
now permeating into videos. However, the advancement of video-to-video (V2V)
synthesis has been hampered by the challenge of maintaining temporal
consistency across video frames. This paper proposes a consistent V2V synthesis
framework by jointly leveraging spatial conditions and temporal optical flow
clues within the source video. Contrary to prior methods that strictly adhere
to optical flow, our approach harnesses its benefits while handling the
imperfection in flow estimation. We encode the optical flow via warping from
the first frame and serve it as a supplementary reference in the diffusion
model. This enables our model for video synthesis by editing the first frame
with any prevalent I2I models and then propagating edits to successive frames.
Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility:
FlowVid works seamlessly with existing I2I models, facilitating various
modifications, including stylization, object swaps, and local edits. (2)
Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution
takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF,
Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our
FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender
(10.2%), and TokenFlow (40.4%).
Related papers
- FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - Looking Backward: Streaming Video-to-Video Translation with Feature Banks [65.46145157488344]
StreamV2V is a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts.
It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow.
arXiv Detail & Related papers (2024-05-24T17:53:06Z) - Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis [51.44526084095757]
We introduce Fairy, a minimalist yet robust adaptation of image-editing diffusion models, enhancing them for video editing applications.
Our approach centers on the concept of anchor-based cross-frame attention, a mechanism that implicitly propagates diffusion features across frames.
A comprehensive user study, involving 1000 generated samples, confirms that our approach delivers superior quality, decisively outperforming established methods.
arXiv Detail & Related papers (2023-12-20T01:49:47Z) - FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z) - MoVideo: Motion-Aware Video Generation with Diffusion Models [97.03352319694795]
We propose a novel motion-aware generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow.
MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
arXiv Detail & Related papers (2023-11-19T13:36:03Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow
Estimation [61.660040308290796]
VideoFlow is a novel optical flow estimation framework for videos.
We first propose a TRi-frame Optical Flow (TROF) module that estimates bi-directional optical flows for the center frame in a three-frame manner.
With the iterative flow estimation refinement, the information fused in individual TROFs can be propagated into the whole sequence via MOP.
arXiv Detail & Related papers (2023-03-15T03:14:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.