Unsupervised Multimodal Video-to-Video Translation via Self-Supervised
Learning
- URL: http://arxiv.org/abs/2004.06502v1
- Date: Tue, 14 Apr 2020 13:44:30 GMT
- Title: Unsupervised Multimodal Video-to-Video Translation via Self-Supervised
Learning
- Authors: Kangning Liu, Shuhang Gu, Andres Romero, Radu Timofte
- Abstract summary: We propose a novel unsupervised video-to-video translation model.
Our model decomposes the style and the content using the specialized UV-decoder structure.
Our model can produce photo-realistic videos in a multimodal way.
- Score: 92.17835753226333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing unsupervised video-to-video translation methods fail to produce
translated videos which are frame-wise realistic, semantic information
preserving and video-level consistent. In this work, we propose UVIT, a novel
unsupervised video-to-video translation model. Our model decomposes the style
and the content, uses the specialized encoder-decoder structure and propagates
the inter-frame information through bidirectional recurrent neural network
(RNN) units. The style-content decomposition mechanism enables us to achieve
style consistent video translation results as well as provides us with a good
interface for modality flexible translation. In addition, by changing the input
frames and style codes incorporated in our translation, we propose a video
interpolation loss, which captures temporal information within the sequence to
train our building blocks in a self-supervised manner. Our model can produce
photo-realistic, spatio-temporal consistent translated videos in a multimodal
way. Subjective and objective experimental results validate the superiority of
our model over existing methods. More details can be found on our project
website: https://uvit.netlify.com
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.