Fugu-MT 論文翻訳(概要): Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

論文の概要: Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

arxiv url: http://arxiv.org/abs/2606.06903v1
Date: Fri, 05 Jun 2026 04:39:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.56907
Title: Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
Title（参考訳）: Beyond Skeletons: Same2Xトレーニング戦略による運転ビデオから直接アニメーションを学習する
Authors: Yuan Zeng, Yujia Shi, Yuhao Yang, Dongxia Liu, Zongqing Lu, Wenming Yang, Qingmin Liao,
Abstract要約: 我々は、ポーズ抽出を回避し、生の運転ビデオから直接学習するDirectAnimatorを提案する。動作,表現,アライメントをセマンティックにリッチだが安定な形式でキャプチャする,ポーズ,顔,位置のキューからなるドライビングキュートリプレットを導入する。クロスID機能と同一IDデータから学んだ機能,最適化の正規化,収束の促進を両立するSame2Xトレーニング戦略を考案する。
参考スコア（独自算出の注目度）: 67.1159444608631
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.
Abstract（参考訳）: 人間の画像アニメーションは、運転映像から抽出されたポーズ情報によってガイドされる静的参照画像からビデオを生成することを目的としている。既存のアプローチでは、中間表現を抽出するためにポーズ推定器に頼っていることが多いが、そのような信号はオクルージョンや複雑なポーズの下でエラーを起こしやすい。これらの観察に基づいて、ポーズ抽出を回避し、生の運転ビデオから直接学習するDirectAnimatorを提案する。動作,表情,アライメントをセマンティックにリッチで安定な形でキャプチャする,ポーズ,顔,位置のキューからなるドライビングキュートリプレットを導入する。運転と参照の同一性が異なる場合の学習を確実にするため,同IDデータから学習したデータとクロスIDの特徴を一致させ,最適化を調整し,収束を加速するSame2Xトレーニング戦略を考案した。広範囲にわたる実験により、DirectAnimatorはオクルージョンや複雑な調音に頑健でありながら、最先端の視覚的品質とアイデンティティの保存が可能であり、少ない計算資源で実現可能であることが示された。私たちのプロジェクトページはhttps://directanimator.github.io/です。

論文の概要: Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

関連論文リスト