Fugu-MT 論文翻訳(概要): Precise Action-to-Video Generation Through Visual Action Prompts

論文の概要: Precise Action-to-Video Generation Through Visual Action Prompts

arxiv url: http://arxiv.org/abs/2508.13104v1
Date: Mon, 18 Aug 2025 17:12:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:11.504837
Title: Precise Action-to-Video Generation Through Visual Action Prompts
Title（参考訳）: ビジュアルアクションプロンプトによる高精度なアクション・ツー・ビデオ生成
Authors: Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu,
Abstract要約: アクション駆動のビデオ生成は、精度と一般性のトレードオフに直面している。エージェント中心のアクション信号は、クロスドメイン転送可能性のコストで精度を提供する。私たちはアクションをドメインに依存しない表現として正確に視覚的なプロンプトに"レンダリング"します。
参考スコア（独自算出の注目度）: 62.951609704196485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.
Abstract（参考訳）: 本稿では、複雑なハイDoFインタラクションのアクション・ツー・ビデオ生成のための統合されたアクション表現である視覚行動プロンプトを、ドメイン間の伝達可能な視覚力学を維持しながら提示する。テキスト、プリミティブアクション、または粗いマスクを使った既存の方法は、汎用性を提供するが精度は低いが、エージェント中心のアクション信号は、クロスドメインの転送容易性を犠牲にして精度を提供する。動作精度と動的転送可能性のバランスをとるために,複雑な動作に対する幾何学的精度とクロスドメイン適応性の両方を保持するドメインに依存しない表現として,アクションを正確に視覚的プロンプトに"レンダリング"することを提案する。本研究では,人間と物体の相互作用(HOI)と器用なロボット操作という,相互作用に富んだ2つのデータソースから骨格を構築するための堅牢なパイプラインを提案する。視覚骨格を、軽量な微調整により事前訓練されたビデオ生成モデルに統合することにより、クロスドメインダイナミクスの学習を保ちながら、複雑な相互作用の正確なアクション制御を可能にする。 EgoVid, RT-1, DROIDの実験により提案手法の有効性が示された。プロジェクトページ: https://zju3dv.github.io/VAP/。

論文の概要: Precise Action-to-Video Generation Through Visual Action Prompts

関連論文リスト