MultiModal Action Conditioned Video Generation
- URL: http://arxiv.org/abs/2510.02287v1
- Date: Thu, 02 Oct 2025 17:57:06 GMT
- Title: MultiModal Action Conditioned Video Generation
- Authors: Yichen Li, Antonio Torralba,
- Abstract summary: General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations.<n>We consider senses of proprioception, kinesthesia, force haptics, and muscle activation.<n>Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift.
- Score: 30.4362754101642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.
Related papers
- Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models [80.28579390566298]
We introduce Interact2Ar, a text-conditioned autoregressive diffusion model for generating full-body, human-human interactions.<n>Hand kinematics are incorporated through dedicated parallel branches, enabling high-fidelity full-body generation.<n>Our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios.
arXiv Detail & Related papers (2025-12-22T18:59:50Z) - Data-driven simulator of multi-animal behavior with unknown dynamics via offline and online reinforcement learning [8.835312357110618]
A key challenge for realistic multi-animal simulation in biology is bridging the gap between unknown real-world transition models and their simulated counterparts.<n>We introduce a data-driven simulator for multi-animal behavior based on deep reinforcement learning and counterfactual simulation.
arXiv Detail & Related papers (2025-10-12T05:08:26Z) - MoReact: Generating Reactive Motion from Textual Descriptions [57.642436102978245]
MoReact is a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially.<n>Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2025-09-28T14:31:41Z) - The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio [138.07247714782412]
MultiGen is a framework that integrates large-scale generative models into traditional physics simulators.<n>We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids.
arXiv Detail & Related papers (2025-07-03T17:59:58Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning [38.004463823796286]
We formulate the motor system of an interactive avatar as a generative motion model.<n>Inspired by recent advances in foundation models, we propose PRIMAL.<n>We leverage the model to create a real-time character animation system in Unreal Engine that feels highly responsive and natural.
arXiv Detail & Related papers (2025-03-21T21:27:57Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - HYPERmotion: Learning Hybrid Behavior Planning for Autonomous Loco-manipulation [7.01404330241523]
HYPERmotion is a framework that learns, selects and plans behaviors based on tasks in different scenarios.
We combine reinforcement learning with whole-body optimization to generate motion for 38 actuated joints.
Experiments in simulation and real-world show that learned motions can efficiently adapt to new tasks.
arXiv Detail & Related papers (2024-06-20T18:21:24Z) - Persistent-Transient Duality: A Multi-mechanism Approach for Modeling
Human-Object Interaction [58.67761673662716]
Humans are highly adaptable, swiftly switching between different modes to handle different tasks, situations and contexts.
In Human-object interaction (HOI) activities, these modes can be attributed to two mechanisms: (1) the large-scale consistent plan for the whole activity and (2) the small-scale children interactive actions that start and end along the timeline.
This work proposes to model two concurrent mechanisms that jointly control human motion.
arXiv Detail & Related papers (2023-07-24T12:21:33Z) - Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z) - An Adaptable Approach to Learn Realistic Legged Locomotion without
Examples [38.81854337592694]
This work proposes a generic approach for ensuring realism in locomotion by guiding the learning process with the spring-loaded inverted pendulum model as a reference.
We present experimental results showing that even in a model-free setup, the learned policies can generate realistic and energy-efficient locomotion gaits for a bipedal and a quadrupedal robot.
arXiv Detail & Related papers (2021-10-28T10:14:47Z) - Learning Reactive and Predictive Differentiable Controllers for
Switching Linear Dynamical Models [7.653542219337937]
We present a framework for learning composite dynamical behaviors from expert demonstrations.
We learn a switching linear dynamical model with contacts encoded in switching conditions as a close approximation of our system dynamics.
We then use discrete-time LQR as the differentiable policy class for data-efficient learning of control to develop a control strategy.
arXiv Detail & Related papers (2021-03-26T04:40:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.