AutoTransition: Learning to Recommend Video Transition Effects
- URL: http://arxiv.org/abs/2207.13479v1
- Date: Wed, 27 Jul 2022 12:00:42 GMT
- Title: AutoTransition: Learning to Recommend Video Transition Effects
- Authors: Yaojie Shen, Libo Zhang, Kai Xu, Xiaojie Jin
- Abstract summary: We present the premier work on performing automatic video transitions recommendation (VTR)
VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots.
We propose a novel multi-modal matching framework which consists of two parts.
- Score: 20.384463765702417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video transition effects are widely used in video editing to connect shots
for creating cohesive and visually appealing videos. However, it is challenging
for non-professionals to choose best transitions due to the lack of
cinematographic knowledge and design skills. In this paper, we present the
premier work on performing automatic video transitions recommendation (VTR):
given a sequence of raw video shots and companion audio, recommend video
transitions for each pair of neighboring shots. To solve this task, we collect
a large-scale video transition dataset using publicly available video templates
on editing softwares. Then we formulate VTR as a multi-modal retrieval problem
from vision/audio to video transitions and propose a novel multi-modal matching
framework which consists of two parts. First we learn the embedding of video
transitions through a video transition classification task. Then we propose a
model to learn the matching correspondence from vision/audio inputs to video
transitions. Specifically, the proposed model employs a multi-modal transformer
to fuse vision and audio information, as well as capture the context cues in
sequential transition outputs. Through both quantitative and qualitative
experiments, we clearly demonstrate the effectiveness of our method. Notably,
in the comprehensive user study, our method receives comparable scores compared
with professional editors while improving the video editing efficiency by
\textbf{300\scalebox{1.25}{$\times$}}. We hope our work serves to inspire other
researchers to work on this new task. The dataset and codes are public at
\url{https://github.com/acherstyx/AutoTransition}.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - Consistent Video-to-Video Transfer Using Synthetic Dataset [12.323784941805519]
We introduce a novel and efficient approach for text-based video-to-video editing.
At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks.
Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain.
arXiv Detail & Related papers (2023-11-01T01:20:12Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - The Anatomy of Video Editing: A Dataset and Benchmark Suite for
AI-Assisted Video Editing [90.59584961661345]
This work introduces the Anatomy of Video Editing, a dataset, and benchmark, to foster research in AI-assisted video editing.
Our benchmark suite focuses on video editing tasks, beyond visual effects, such as automatic footage organization and assisted video assembling.
To enable research on these fronts, we annotate more than 1.5M tags, with relevant concepts to cinematography, from 196176 shots sampled from movie scenes.
arXiv Detail & Related papers (2022-07-20T10:53:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.