VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models
- URL: http://arxiv.org/abs/2311.18837v1
- Date: Thu, 30 Nov 2023 18:59:52 GMT
- Title: VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models
- Authors: Zhen Xing and Qi Dai and Zihao Zhang and Hui Zhang and Han Hu and
Zuxuan Wu and Yu-Gang Jiang
- Abstract summary: Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
- Score: 96.55004961251889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have achieved significant success in image and video
generation. This motivates a growing interest in video editing tasks, where
videos are edited according to provided text descriptions. However, most
existing approaches only focus on video editing for short clips and rely on
time-consuming tuning or inference. We are the first to propose Video
Instruction Diffusion (VIDiff), a unified foundation model designed for a wide
range of video tasks. These tasks encompass both understanding tasks (such as
language-guided video object segmentation) and generative tasks (video editing
and enhancement). Our model can edit and translate the desired results within
seconds based on user instructions. Moreover, we design an iterative
auto-regressive method to ensure consistency in editing and enhancing long
videos. We provide convincing generative results for diverse input videos and
written instructions, both qualitatively and quantitatively. More examples can
be found at our website https://ChenHsing.github.io/VIDiff.
Related papers
- Step Differences in Instructional Video [34.551572600535565]
We propose an approach that generates visual instruction tuning data involving pairs of videos from HowTo100M.
We then trains a video-conditioned language model to jointly reason across multiple raw videos.
Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos.
arXiv Detail & Related papers (2024-04-24T21:49:59Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - EffiVED:Efficient Video Editing via Text-instruction Diffusion Models [9.287394166165424]
We introduce EffiVED, an efficient diffusion-based model that supports instruction-guided video editing.
We transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED.
arXiv Detail & Related papers (2024-03-18T08:42:08Z) - RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing
with Diffusion Models [19.792535444735957]
RAVE is a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training.
It produces high-quality videos while preserving original motion and semantic structure.
RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations.
arXiv Detail & Related papers (2023-12-07T18:43:45Z) - AVID: Any-Length Video Inpainting with Diffusion Model [30.860927136236374]
We introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID.
Our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting.
Our experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality.
arXiv Detail & Related papers (2023-12-06T18:56:14Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing.
Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z) - Learning to Cut by Watching Movies [114.57935905189416]
This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility.
Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that trigger cuts.
We devise a model that learns to discriminate between real and artificial cuts via contrastive learning.
arXiv Detail & Related papers (2021-08-09T18:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.