Fugu-MT 論文翻訳(概要): ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

論文の概要: ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

arxiv url: http://arxiv.org/abs/2604.01043v1
Date: Wed, 01 Apr 2026 15:52:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:32.070092
Title: ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration
Title（参考訳）: ONE-SHOT:空間分離型モーションインジェクションとハイブリッドコンテキスト統合による構成的人間環境ビデオ合成
Authors: Fengyuan Yang, Luying Huang, Jiazhi Guan, Quanwei Yang, Dongwei Pan, Jianglin Fu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Angela Yao,
Abstract要約: 人環境ビデオ生成のためのパラメータ効率のよいフレームワークであるONE-SHOTを提案する。我々の重要な洞察は、生成過程を不整合信号に分解することであり、特に、人間を環境条件から切り離す標準的な空間注入機構を導入する。提案手法は最先端の手法よりも優れており,ビデオ合成における優れた構造制御と創造的多様性を提供する。
参考スコア（独自算出の注目度）: 49.72976665549397
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.
Abstract（参考訳）: ビデオファウンデーションモデル(VFM)の最近の進歩は、人間中心のビデオ合成に革命をもたらしたが、被写体やシーンの微粒で独立した編集は重要な課題である。近年, 厳密な3次元幾何学的構成による環境制御を取り入れようとする試みは, 正確な制御と生成の柔軟性のトレードオフにしばしば遭遇する。さらに、重い3D前処理は実用的スケーラビリティを制限している。本稿では,人文環境ビデオ生成のためのパラメータ効率のよいフレームワークであるONE-SHOTを提案する。私たちの重要な洞察は、生成過程を非絡み合った信号に分解することです。具体的には,クロスアテンションを介して環境条件から人間の力学を分離する標準的な空間注入機構を導入する。また、異なる空間領域間の空間的対応をヒューリスティックな3次元アライメントなしで確立する新しい位置埋め込み戦略であるDynamic-Grounded-RoPEを提案する。時間軸合成を支援するために,主観的・場面的整合性を維持するためのHybrid Context Integration機構を導入する。実験により,本手法は最先端の手法よりも優れ,ビデオ合成における優れた構造制御と創造的多様性を提供することが示された。私たちのプロジェクトは、https://martayang.github.io/ONE-SHOT/で利用可能です。

論文の概要: ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

関連論文リスト