InteractiveVideo: User-Centric Controllable Video Generation with
Synergistic Multimodal Instructions
- URL: http://arxiv.org/abs/2402.03040v1
- Date: Mon, 5 Feb 2024 14:24:46 GMT
- Title: InteractiveVideo: User-Centric Controllable Video Generation with
Synergistic Multimodal Instructions
- Authors: Yiyuan Zhang, Yuhao Kang, Zhixin Zhang, Xiaohan Ding, Sanyuan Zhao,
Xiangyu Yue
- Abstract summary: $textitInteractiveVideo$ is a user-centric framework for video generation.
We propose a Synergistic Multimodal Instruction mechanism to seamlessly integrate users' multimodal instructions into generative models.
With $textitInteractiveVideo$, users are given the flexibility to meticulously tailor key aspects of a video.
- Score: 23.536645072596656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce $\textit{InteractiveVideo}$, a user-centric framework for video
generation. Different from traditional generative approaches that operate based
on user-provided images or text, our framework is designed for dynamic
interaction, allowing users to instruct the generative model through various
intuitive mechanisms during the whole generation process, e.g. text and image
prompts, painting, drag-and-drop, etc. We propose a Synergistic Multimodal
Instruction mechanism, designed to seamlessly integrate users' multimodal
instructions into generative models, thus facilitating a cooperative and
responsive interaction between user inputs and the generative process. This
approach enables iterative and fine-grained refinement of the generation result
through precise and effective user instructions. With
$\textit{InteractiveVideo}$, users are given the flexibility to meticulously
tailor key aspects of a video. They can paint the reference image, edit
semantics, and adjust video motions until their requirements are fully met.
Code, models, and demo are available at
https://github.com/invictus717/InteractiveVideo
Related papers
- Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.
We introduce a new approach that models video-text as game players using multivariate cooperative game theory.
We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - MotionBridge: Dynamic Video Inbetweening with Flexible Controls [29.029643539300434]
We introduce MotionBridge, a unified video inbetweening framework.
It allows flexible controls, including trajectory strokes, video editing masks, guide pixels, and text video.
We show that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
arXiv Detail & Related papers (2024-12-17T18:59:33Z) - Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.
By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.
Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - Interactive Text Generation [75.23894005664533]
We introduce a new Interactive Text Generation task that allows training generation models interactively without the costs of involving real users.
We train our interactive models using Imitation Learning, and our experiments against competitive non-interactive generation models show that models trained interactively are superior to their non-interactive counterparts.
arXiv Detail & Related papers (2023-03-02T01:57:17Z) - A Framework for Integrating Gesture Generation Models into Interactive
Conversational Agents [0.0]
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users.
Recent end-to-end gesture generation methods have not been evaluated in a real-time interaction with users.
We present a proof-of-concept framework which is intended to facilitate evaluation of modern gesture generation models in interaction.
arXiv Detail & Related papers (2021-02-24T14:31:21Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.