Consistent Video-to-Video Transfer Using Synthetic Dataset
- URL: http://arxiv.org/abs/2311.00213v3
- Date: Fri, 1 Dec 2023 11:41:34 GMT
- Title: Consistent Video-to-Video Transfer Using Synthetic Dataset
- Authors: Jiaxin Cheng, Tianjun Xiao and Tong He
- Abstract summary: We introduce a novel and efficient approach for text-based video-to-video editing.
At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks.
Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain.
- Score: 12.323784941805519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel and efficient approach for text-based video-to-video
editing that eliminates the need for resource-intensive per-video-per-model
finetuning. At the core of our approach is a synthetic paired video dataset
tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's
image transfer via editing instruction, we adapt this paradigm to the video
domain. Extending the Prompt-to-Prompt to videos, we efficiently generate
paired samples, each with an input video and its edited counterpart. Alongside
this, we introduce the Long Video Sampling Correction during sampling, ensuring
consistent long videos across batches. Our method surpasses current methods
like Tune-A-Video, heralding substantial progress in text-based video-to-video
editing and suggesting exciting avenues for further exploration and deployment.
Related papers
- Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model [133.01510927611452]
We present Step-Video-T2V, a text-to-video pre-trained model with 30Bational parameters and the ability to generate videos up to 204 frames in length.
A deep compression Vari Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios.
Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality.
arXiv Detail & Related papers (2025-02-14T15:58:10Z) - Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs [6.300563383392837]
The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing.
We propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing.
arXiv Detail & Related papers (2025-01-10T11:35:43Z) - Text-Video Multi-Grained Integration for Video Moment Montage [13.794791614348084]
A new task called Video Moment Montage (VMM) aims to accurately locate the corresponding video segments based on a pre-provided narration text.
We present a novel textitText-Video Multi-Grained Integration method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features.
arXiv Detail & Related papers (2024-12-12T13:40:59Z) - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing [8.907836546058086]
Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion.
We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs)
Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
arXiv Detail & Related papers (2024-03-10T17:12:01Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR)
VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots.
We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.