Related papers: Consistent Video-to-Video Transfer Using Synthetic Dataset

Consistent Video-to-Video Transfer Using Synthetic Dataset

URL: http://arxiv.org/abs/2311.00213v3
Date: Fri, 1 Dec 2023 11:41:34 GMT
Title: Consistent Video-to-Video Transfer Using Synthetic Dataset
Authors: Jiaxin Cheng, Tianjun Xiao and Tong He
Abstract summary: We introduce a novel and efficient approach for text-based video-to-video editing. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain.
Score: 12.323784941805519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.

Related papers

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation [51.328567400947435]
We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors.
arXiv Detail & Related papers (2025-03-12T08:26:15Z)
Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs [6.300563383392837]
The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing. We propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing.
arXiv Detail & Related papers (2025-01-10T11:35:43Z)
Text-Video Multi-Grained Integration for Video Moment Montage [13.794791614348084]
A new task called Video Moment Montage (VMM) aims to accurately locate the corresponding video segments based on a pre-provided narration text. We present a novel textitText-Video Multi-Grained Integration method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features.
arXiv Detail & Related papers (2024-12-12T13:40:59Z)
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer. It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing [8.907836546058086]
Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion. We propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs) Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule.
arXiv Detail & Related papers (2024-03-10T17:12:01Z)
VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z)
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing. It attains temporally consistent editing of input videos in a training-free manner. Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR) VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots. We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z)
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning. Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.