Fugu-MT 論文翻訳(概要): AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

論文の概要: AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

arxiv url: http://arxiv.org/abs/2603.14331v1
Date: Sun, 15 Mar 2026 11:42:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.751938
Title: AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising
Title（参考訳）: AvatarForcing: ローカル・フューチャースライディング・ウィンドウデノイングによるワンステップストリーミングアバター
Authors: Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, Xiaoqiang Liu,
Abstract要約: AvatarForcingは、一段階のストリーミング拡散フレームワークで、不均一なノイズレベルを持つ固定されたローカルフューチャーウィンドウを識別する。標準ベンチマークと400ビデオのロングフォームベンチマークの実験では、強い視覚的品質と34ms/frameでの唇の同期が示されている。
参考スコア（独自算出の注目度）: 15.787466786514164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/
Abstract（参考訳）: リアルタイム音声アバター生成は低レイテンシと極小レベルの時間安定性を必要とする。自動回帰(AR)強制は、ストリーミング推論を可能にするが、露出バイアスに悩まされ、長いロールアウトでエラーが蓄積され、不可逆になる。対照的に、フルシーケンス拡散変換器はドリフトを緩和するが、リアルタイムの長期合成には計算的に禁じられている。 AvatarForcingは、一段階のストリーミング拡散フレームワークで、固定されたローカルフューチャーウィンドウに異種ノイズレベルを付与し、ステップごとのクリーンブロックを一定コストで出力する。非有界ストリームを安定化させるために、この手法では、アクティブウィンドウに対してRoPEを再インデックスして固定された相対位置を維持するスタイルアンカーと、最近発行されたクリーンブロックを再利用してスムーズな遷移を保証するテンポラルアンカーを導入する。オフラインのODEバックフィルと分散マッチングを備えた2段階のストリーミング蒸留により、リアルタイムワンステップ推論が可能となる。標準ベンチマークと400ビデオの新しいロングフォームベンチマークの実験は、リアルタイムストリーミングのために1.3Bパラメーターの学生モデルを用いて、34ms/frameで視覚的品質と唇の同期を示す。 https://cuiliyuan121.github.io/AvatarForcing/

論文の概要: AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

関連論文リスト