AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition
- URL: http://arxiv.org/abs/2408.11564v1
- Date: Wed, 21 Aug 2024 12:18:22 GMT
- Title: AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition
- Authors: Minheng Ni, Chenfei Wu, Huaying Yuan, Zhengyuan Yang, Ming Gong, Lijuan Wang, Zicheng Liu, Wangmeng Zuo, Nan Duan,
- Abstract summary: AutoDirector is an interactive multi-sensory composition framework that supports long shots, special effects, music scoring, dubbing, and lip-syncing.
It improves the efficiency of multi-sensory film production through automatic scheduling and supports the modification and improvement of interactive tasks to meet user needs.
- Score: 149.89952404881174
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism. However, the approach to generate multi-sensory outputs has not been fully explored, limiting the application on high-value scenarios such as of directing a film. Developing a movie director agent faces two major challenges: (1) Lack of parallelism and online scheduling with production steps: In the production of multi-sensory films, there are complex dependencies between different sensory elements, and the production time for each element varies. (2) Diverse needs and clear communication demands with users: Users often cannot clearly express their needs until they see a draft, which requires human-computer interaction and iteration to continually adjust and optimize the film content based on user feedback. To address these issues, we introduce AutoDirector, an interactive multi-sensory composition framework that supports long shots, special effects, music scoring, dubbing, and lip-syncing. This framework improves the efficiency of multi-sensory film production through automatic scheduling and supports the modification and improvement of interactive tasks to meet user needs. AutoDirector not only expands the application scope of human-machine collaboration but also demonstrates the potential of AI in collaborating with humans in the role of a film director to complete multi-sensory films.
Related papers
- Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration [64.6107798750142]
Vocal Sandbox is a framework for enabling seamless human-robot collaboration in situated environments.
We design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time.
We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation.
arXiv Detail & Related papers (2024-11-04T20:44:40Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Towards Embedding Dynamic Personas in Interactive Robots: Masquerading Animated Social Kinematics (MASK) [10.351714893090964]
This paper presents the design and development of an innovative interactive robotic system to enhance audience engagement using character-like personas.
Built upon the foundations of persona-driven dialog agents, this work extends the agent's application to the physical realm, employing robots to provide a more captivating and interactive experience.
arXiv Detail & Related papers (2024-03-15T06:22:32Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - GTAutoAct: An Automatic Datasets Generation Framework Based on Game
Engine Redevelopment for Action Recognition [12.521014978532548]
GTAutoAct is a novel dataset generation framework leveraging game engine technology to facilitate advancements in action recognition.
It transforms coordinate-based 3D human motion into rotation-orientated representation with enhanced suitability in multiple viewpoints.
It implements an autonomous video capture and processing pipeline, featuring a randomly navigating camera, with auto-trimming and labeling functionalities.
arXiv Detail & Related papers (2024-01-24T12:18:31Z) - ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions [66.87211993793807]
We present ReMoS, a denoising diffusion based model that synthesizes full body motion of a person in two person interaction scenario.
We demonstrate ReMoS across challenging two person scenarios such as pair dancing, Ninjutsu, kickboxing, and acrobatics.
We also contribute the ReMoCap dataset for two person interactions containing full body and finger motions.
arXiv Detail & Related papers (2023-11-28T18:59:52Z) - InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions [49.097973114627344]
We present InterGen, an effective diffusion-based approach that incorporates human-to-human interactions into the motion diffusion process.
We first contribute a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions.
We propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame.
arXiv Detail & Related papers (2023-04-12T08:12:29Z) - Smart Director: An Event-Driven Directing System for Live Broadcasting [110.30675947733167]
Smart Director aims at mimicking the typical human-in-the-loop broadcasting process to automatically create near-professional broadcasting programs in real-time.
Our system is the first end-to-end automated directing system for multi-camera sports broadcasting.
arXiv Detail & Related papers (2022-01-11T16:14:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.