Fugu-MT 論文翻訳(概要): SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

論文の概要: SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

arxiv url: http://arxiv.org/abs/2603.11746v1
Date: Thu, 12 Mar 2026 09:49:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.004455
Title: SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
Title（参考訳）: SoulX-LiveAct:隣の強制とConvKVメモリによる時間スケールリアルタイムアニメーションを目指して
Authors: Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, Shunshun Yin,
Abstract要約: 自己回帰(AR)拡散モデルは、ビデオ合成のような逐次生成タスクのための有望なフレームワークを提供する。同じ雑音条件下で隣接フレームを時間的に伝播する拡散ステップ整合AR式であるNeighbor Forcingを提案する。提案手法は,既存のAR拡散法と比較して,トレーニング収束,時間スケール生成品質,推論効率を著しく向上させる。
参考スコア（独自算出の注目度）: 25.57144961436966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.
Abstract（参考訳）: 自己回帰(AR)拡散モデルは、拡散モデリングと因果推論を組み合わせることで、ビデオ合成のような逐次生成タスクのための有望なフレームワークを提供する。ストリーミング生成をサポートするが、既存のAR拡散手法は効率よくスケールするのに苦労している。本稿では,時間スケールリアルタイムアニメーションにおける2つの課題について述べる。第一に、ほとんどの強制戦略は、ミスマッチした拡散状態でサンプルレベルの表現を伝播させ、一貫性のない学習信号と不安定な収束を引き起こす。第二に、歴史的表現は無制限に成長し、構造が欠如し、キャッシュされた状態の効果的な再利用を防ぎ、推論効率を著しく制限する。これらの課題に対処するため,同じ雑音条件下で隣接フレームを時間的に伝播する拡散ステップ整合AR式であるNeighbor Forcingを提案する。この設計は、ARチェーン全体のドリフトを保ちながら、分布整列で安定した学習信号を提供する。そこで我々は,ConvKVの構造化メモリ機構を導入し,因果的注意のキーと値を一定長の表現に圧縮し,短時間のモーションフレームメモリに頼ることなく,一定メモリの推論と真に無限の動画生成を可能にする。大規模な実験により,既存のAR拡散法と比較して,トレーニング収束,時間スケール生成品質,推論効率が有意に向上することが示された。数値的には、LiveActは時間スケールのリアルタイムアニメーションを可能にし、NVIDIA H100またはH200 GPUで20 FPSのリアルタイムストリーミング推論をサポートする。定量的な結果から,本手法は低い推論コストで,リップ同期精度,人間のアニメーション品質,感情表現性を達成できることが示唆された。

論文の概要: SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

関連論文リスト