Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning
- URL: http://arxiv.org/abs/2203.02573v1
- Date: Fri, 4 Mar 2022 21:09:13 GMT
- Title: Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning
- Authors: Ligong Han and Jian Ren and Hsin-Ying Lee and Francesco Barbieri and
Kyle Olszewski and Shervin Minaee and Dimitris Metaxas and Sergey Tulyakov
- Abstract summary: This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
- Score: 36.85533835408882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most methods for conditional video synthesis use a single modality as the
condition. This comes with major limitations. For example, it is problematic
for a model conditioned on an image to generate a specific motion trajectory
desired by the user since there is no means to provide motion information.
Conversely, language information can describe the desired motion, while not
precisely defining the content of the video. This work presents a multimodal
video generation framework that benefits from text and images provided jointly
or separately. We leverage the recent progress in quantized representations for
videos and apply a bidirectional transformer with multiple modalities as inputs
to predict a discrete video representation. To improve video quality and
consistency, we propose a new video token trained with self-learning and an
improved mask-prediction algorithm for sampling video tokens. We introduce text
augmentation to improve the robustness of the textual representation and
diversity of generated videos. Our framework can incorporate various visual
modalities, such as segmentation masks, drawings, and partially occluded
images. It can generate much longer sequences than the one used for training.
In addition, our model can extract visual information as suggested by the text
prompt, e.g., "an object in image one is moving northeast", and generate
corresponding videos. We run evaluations on three public datasets and a newly
collected dataset labeled with facial attributes, achieving state-of-the-art
generation results on all four.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Phenaki: Variable Length Video Generation From Open Domain Textual
Description [21.610541668826006]
Phenaki is a model capable of realistic video synthesis given a sequence of textual prompts.
New model for learning video representation compresses the video to a small representation of discrete tokens.
To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
arXiv Detail & Related papers (2022-10-05T17:18:28Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.