Related papers: InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions

InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions

URL: http://arxiv.org/abs/2402.03040v1
Date: Mon, 5 Feb 2024 14:24:46 GMT
Title: InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
Authors: Yiyuan Zhang, Yuhao Kang, Zhixin Zhang, Xiaohan Ding, Sanyuan Zhao, Xiangyu Yue
Abstract summary: $textitInteractiveVideo$ is a user-centric framework for video generation. We propose a Synergistic Multimodal Instruction mechanism to seamlessly integrate users' multimodal instructions into generative models. With $textitInteractiveVideo$, users are given the flexibility to meticulously tailor key aspects of a video.
Score: 23.536645072596656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce $\textit{InteractiveVideo}$, a user-centric framework for video generation. Different from traditional generative approaches that operate based on user-provided images or text, our framework is designed for dynamic interaction, allowing users to instruct the generative model through various intuitive mechanisms during the whole generation process, e.g. text and image prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal Instruction mechanism, designed to seamlessly integrate users' multimodal instructions into generative models, thus facilitating a cooperative and responsive interaction between user inputs and the generative process. This approach enables iterative and fine-grained refinement of the generation result through precise and effective user instructions. With $\textit{InteractiveVideo}$, users are given the flexibility to meticulously tailor key aspects of a video. They can paint the reference image, edit semantics, and adjust video motions until their requirements are fully met. Code, models, and demo are available at https://github.com/invictus717/InteractiveVideo

Related papers

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain. We introduce a new approach that models video-text as game players using multivariate cooperative game theory. We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z)
MotionBridge: Dynamic Video Inbetweening with Flexible Controls [29.029643539300434]
We introduce MotionBridge, a unified video inbetweening framework. It allows flexible controls, including trajectory strokes, video editing masks, guide pixels, and text video. We show that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
arXiv Detail & Related papers (2024-12-17T18:59:33Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process. Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation [70.93295323156876]
We propose a framework that can accept multiple frames simultaneously and explore synergistic interaction across frames (SIAF) Our SwinB-SIAF achieves new state-of-the-art performance on DAVIS 2017 (89.6%, J&F@60) Our R50-SIAF is more than 3 faster than the state-of-the-art competitor under challenging multi-object scenarios.
arXiv Detail & Related papers (2024-01-23T04:19:15Z)
MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process. Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z)
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models [43.46536102838717]
VideoDreamer is a novel framework for customized multi-subject text-to-video generation. It can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z)
Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
Interactive Text Generation [75.23894005664533]
We introduce a new Interactive Text Generation task that allows training generation models interactively without the costs of involving real users. We train our interactive models using Imitation Learning, and our experiments against competitive non-interactive generation models show that models trained interactively are superior to their non-interactive counterparts.
arXiv Detail & Related papers (2023-03-02T01:57:17Z)
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z)
A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents [0.0]
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Recent end-to-end gesture generation methods have not been evaluated in a real-time interaction with users. We present a proof-of-concept framework which is intended to facilitate evaluation of modern gesture generation models in interaction.
arXiv Detail & Related papers (2021-02-24T14:31:21Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
Multimodal Matching Transformer for Live Commenting [97.06576354830736]
Automatic live commenting aims to provide real-time comments on videos for viewers. Recent work on this task adopts encoder-decoder models to generate comments. We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
arXiv Detail & Related papers (2020-02-07T07:19:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.