Fugu-MT 論文翻訳(概要): SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

論文の概要: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

arxiv url: http://arxiv.org/abs/2506.23690v1
Date: Mon, 30 Jun 2025 10:09:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-01 21:27:54.012185
Title: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
Title（参考訳）: SynMotion:モーションカスタマイズビデオ生成のためのセマンティック・ビジュアル適応
Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen,
Abstract要約: SynMotion(シンモクション)は、セマンティックガイダンスと視覚適応を併用した動画生成モデルである。意味レベルでは、主観と動きの表現をアンタングルする二項意味理解機構を導入する。視覚レベルでは、効率的なモーションアダプタをトレーニング済みのビデオ生成モデルに統合し、動きの忠実度と時間的コヒーレンスを高める。
参考スコア（独自算出の注目度）: 56.90807453045657
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: https://lucaria-academy.github.io/SynMotion/
Abstract（参考訳）: 拡散に基づく動画モーションのカスタマイズは、正確なテキストコンディショニングを通じて任意の被験者の移動を達成しつつ、少数のビデオサンプルから人間のモーション表現の取得を容易にする。既存のアプローチは、しばしば意味レベルのアライメントに依存し、モデルが新しい動きの概念を学び、それを他の実体(例えば 'cats' や 'dogs' など)と組み合わせて視覚的に魅力的な結果を生み出すことを期待する。しかし、ビデオデータは複雑な時空間パターンを伴い、意味論にのみ焦点をあてることで、モデルが動きの視覚的複雑さを見落としてしまう。逆に、視覚的表現のみをチューニングすることは、意図した動作を表現する際に意味的な混乱をもたらす。これらの制約に対処するために,セマンティックガイダンスと視覚的適応を併用した新しい動画生成モデルであるSynMotionを提案する。セマンティックレベルでは、対象と動作表現をアンタングルする二重埋め込み意味理解機構を導入し、モデルが多様な対象に対して生成能力を保ちながら、カスタマイズされた動作特徴を学習できるようにする。視覚レベルでは、パラメータ効率の良いモーションアダプタをトレーニング済みのビデオ生成モデルに統合し、動きの忠実度と時間的コヒーレンスを高める。さらに,手動で構築した主観ビデオ(SPV)トレーニングデータセットによって支援された対象と動作の埋め込みを,‘textbf{alternatelylyly>’に最適化する,新たな埋め込み専用トレーニング戦略を導入する。この戦略は、様々な対象にまたがる一般化を維持しながら、運動特異性を促進する。最後に、さまざまな動きパターンを持つ新しくキュレートされたベンチマークであるMotionBenchを紹介する。 T2VとI2Vの両方の設定での実験結果から、‘method’が既存のベースラインより優れていることが示された。プロジェクトページ:https://lucaria-academy.github.io/SynMotion/

論文の概要: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

関連論文リスト