Consistent Video-to-Video Transfer Using Synthetic Dataset
- URL: http://arxiv.org/abs/2311.00213v3
- Date: Fri, 1 Dec 2023 11:41:34 GMT
- Title: Consistent Video-to-Video Transfer Using Synthetic Dataset
- Authors: Jiaxin Cheng, Tianjun Xiao and Tong He
- Abstract summary: We introduce a novel and efficient approach for text-based video-to-video editing.
At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks.
Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain.
- Score: 12.323784941805519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel and efficient approach for text-based video-to-video
editing that eliminates the need for resource-intensive per-video-per-model
finetuning. At the core of our approach is a synthetic paired video dataset
tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's
image transfer via editing instruction, we adapt this paradigm to the video
domain. Extending the Prompt-to-Prompt to videos, we efficiently generate
paired samples, each with an input video and its edited counterpart. Alongside
this, we introduce the Long Video Sampling Correction during sampling, ensuring
consistent long videos across batches. Our method surpasses current methods
like Tune-A-Video, heralding substantial progress in text-based video-to-video
editing and suggesting exciting avenues for further exploration and deployment.
Related papers
- RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
It supports multiple video editing capabilities such as removal, addition, modification and simplifying, through a unified pipeline.
The proposed framework demonstrates impressive capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video
Editing [10.011515580084243]
Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion.
We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs)
Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
arXiv Detail & Related papers (2024-03-10T17:12:01Z) - VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.
We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - Control-A-Video: Controllable Text-to-Video Generation with Diffusion
Models [52.512109160994655]
We present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis.
For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation.
Our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR)
VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots.
We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.