Precise Action-to-Video Generation Through Visual Action Prompts
- URL: http://arxiv.org/abs/2508.13104v1
- Date: Mon, 18 Aug 2025 17:12:28 GMT
- Title: Precise Action-to-Video Generation Through Visual Action Prompts
- Authors: Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu,
- Abstract summary: Action-driven video generation faces a precision-generality trade-off.<n>Agent-centric action signals provide precision at the cost of cross-domain transferability.<n>We "render" actions into precise visual prompts as domain-agnostic representations.
- Score: 62.951609704196485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.
Related papers
- Astra: General Interactive World Model with Autoregressive Denoising [73.6594791733982]
Astra is an interactive general world model that generates real-world futures for diverse scenarios.<n>We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations.<n>Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions.
arXiv Detail & Related papers (2025-12-09T18:59:57Z) - Mask2IV: Interaction-Centric Video Generation via Mask Trajectories [32.04930240447431]
Mask2IV is a novel framework specifically designed for interaction-centric video generation.<n>It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories.<n>It supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues.
arXiv Detail & Related papers (2025-10-03T16:04:33Z) - iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer [43.58952721477297]
This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation.<n> Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model.
arXiv Detail & Related papers (2025-06-15T13:41:43Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning [13.411096520754507]
Existing video captioning methods merely provide shallow or simplistic representations of object behaviors.<n>We propose a dynamic action semantic-aware graph transformer to comprehensively capture the essence of object behavior.
arXiv Detail & Related papers (2025-02-19T14:16:47Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation? [17.356760351203715]
This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects.<n>We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap.<n>We significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios.
arXiv Detail & Related papers (2024-12-13T11:22:01Z) - Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [57.942404069484134]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.<n>Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.<n>We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.