Consistent Video-to-Video Transfer Using Synthetic Dataset
- URL: http://arxiv.org/abs/2311.00213v3
- Date: Fri, 1 Dec 2023 11:41:34 GMT
- Title: Consistent Video-to-Video Transfer Using Synthetic Dataset
- Authors: Jiaxin Cheng, Tianjun Xiao and Tong He
- Abstract summary: We introduce a novel and efficient approach for text-based video-to-video editing.
At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks.
Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain.
- Score: 12.323784941805519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel and efficient approach for text-based video-to-video
editing that eliminates the need for resource-intensive per-video-per-model
finetuning. At the core of our approach is a synthetic paired video dataset
tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's
image transfer via editing instruction, we adapt this paradigm to the video
domain. Extending the Prompt-to-Prompt to videos, we efficiently generate
paired samples, each with an input video and its edited counterpart. Alongside
this, we introduce the Long Video Sampling Correction during sampling, ensuring
consistent long videos across batches. Our method surpasses current methods
like Tune-A-Video, heralding substantial progress in text-based video-to-video
editing and suggesting exciting avenues for further exploration and deployment.
Related papers
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing [8.907836546058086]
Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion.
We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs)
Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
arXiv Detail & Related papers (2024-03-10T17:12:01Z) - VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.
We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR)
VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots.
We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.