Related papers: Precise Action-to-Video Generation Through Visual Action Prompts

Precise Action-to-Video Generation Through Visual Action Prompts

URL: http://arxiv.org/abs/2508.13104v1
Date: Mon, 18 Aug 2025 17:12:28 GMT
Title: Precise Action-to-Video Generation Through Visual Action Prompts
Authors: Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu,
Abstract summary: Action-driven video generation faces a precision-generality trade-off.<n>Agent-centric action signals provide precision at the cost of cross-domain transferability.<n>We "render" actions into precise visual prompts as domain-agnostic representations.
Score: 62.951609704196485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.

Related papers

Astra: General Interactive World Model with Autoregressive Denoising [73.6594791733982]
Astra is an interactive general world model that generates real-world futures for diverse scenarios.<n>We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations.<n>Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions.
arXiv Detail & Related papers (2025-12-09T18:59:57Z)
Mask2IV: Interaction-Centric Video Generation via Mask Trajectories [32.04930240447431]
Mask2IV is a novel framework specifically designed for interaction-centric video generation.<n>It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories.<n>It supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues.
arXiv Detail & Related papers (2025-10-03T16:04:33Z)
iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer [43.58952721477297]
This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation.<n> Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model.
arXiv Detail & Related papers (2025-06-15T13:41:43Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning [13.411096520754507]
Existing video captioning methods merely provide shallow or simplistic representations of object behaviors.<n>We propose a dynamic action semantic-aware graph transformer to comprehensively capture the essence of object behavior.
arXiv Detail & Related papers (2025-02-19T14:16:47Z)
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation? [17.356760351203715]
This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects.<n>We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap.<n>We significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios.
arXiv Detail & Related papers (2024-12-13T11:22:01Z)
Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [57.942404069484134]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.<n>Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.<n>We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.