InteractiveVideo: User-Centric Controllable Video Generation with
Synergistic Multimodal Instructions
- URL: http://arxiv.org/abs/2402.03040v1
- Date: Mon, 5 Feb 2024 14:24:46 GMT
- Title: InteractiveVideo: User-Centric Controllable Video Generation with
Synergistic Multimodal Instructions
- Authors: Yiyuan Zhang, Yuhao Kang, Zhixin Zhang, Xiaohan Ding, Sanyuan Zhao,
Xiangyu Yue
- Abstract summary: $textitInteractiveVideo$ is a user-centric framework for video generation.
We propose a Synergistic Multimodal Instruction mechanism to seamlessly integrate users' multimodal instructions into generative models.
With $textitInteractiveVideo$, users are given the flexibility to meticulously tailor key aspects of a video.
- Score: 23.536645072596656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce $\textit{InteractiveVideo}$, a user-centric framework for video
generation. Different from traditional generative approaches that operate based
on user-provided images or text, our framework is designed for dynamic
interaction, allowing users to instruct the generative model through various
intuitive mechanisms during the whole generation process, e.g. text and image
prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal
Instruction mechanism, designed to seamlessly integrate users' multimodal
instructions into generative models, thus facilitating a cooperative and
responsive interaction between user inputs and the generative process. This
approach enables iterative and fine-grained refinement of the generation result
through precise and effective user instructions. With
$\textit{InteractiveVideo}$, users are given the flexibility to meticulously
tailor key aspects of a video. They can paint the reference image, edit
semantics, and adjust video motions until their requirements are fully met.
Code, models, and demo are available at
https://github.com/invictus717/InteractiveVideo
Related papers
- Explore Synergistic Interaction Across Frames for Interactive Video
Object Segmentation [70.93295323156876]
We propose a framework that can accept multiple frames simultaneously and explore synergistic interaction across frames (SIAF)
Our SwinB-SIAF achieves new state-of-the-art performance on DAVIS 2017 (89.6%, J&F@60)
Our R50-SIAF is more than 3 faster than the state-of-the-art competitor under challenging multi-object scenarios.
arXiv Detail & Related papers (2024-01-23T04:19:15Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Interactive Text Generation [75.23894005664533]
We introduce a new Interactive Text Generation task that allows training generation models interactively without the costs of involving real users.
We train our interactive models using Imitation Learning, and our experiments against competitive non-interactive generation models show that models trained interactively are superior to their non-interactive counterparts.
arXiv Detail & Related papers (2023-03-02T01:57:17Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - A Framework for Integrating Gesture Generation Models into Interactive
Conversational Agents [0.0]
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users.
Recent end-to-end gesture generation methods have not been evaluated in a real-time interaction with users.
We present a proof-of-concept framework which is intended to facilitate evaluation of modern gesture generation models in interaction.
arXiv Detail & Related papers (2021-02-24T14:31:21Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - Multimodal Matching Transformer for Live Commenting [97.06576354830736]
Automatic live commenting aims to provide real-time comments on videos for viewers.
Recent work on this task adopts encoder-decoder models to generate comments.
We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
arXiv Detail & Related papers (2020-02-07T07:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.