Fugu-MT 論文翻訳(概要): StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

論文の概要: StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

arxiv url: http://arxiv.org/abs/2508.08248v1
Date: Mon, 11 Aug 2025 17:58:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:29.256328
Title: StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
Title（参考訳）: StableAvatar:無限長のオーディオ駆動アバタービデオ
Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang,
Abstract要約: オーディオ駆動型アバタービデオ生成のための現在の拡散モデルでは、自然な音声同期とアイデンティティの整合性を備えた長ビデオの合成が困難である。本稿では,無限長高画質映像を後処理なしで合成する最初のエンドツーエンドビデオ拡散変換器であるStableAvatarについて述べる。
参考スコア（独自算出の注目度）: 91.45910771331741
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
Abstract（参考訳）: オーディオ駆動型アバタービデオ生成のための現在の拡散モデルでは、自然な音声同期とアイデンティティの整合性を備えた長ビデオの合成が困難である。本稿では,無限長高画質映像を後処理なしで合成する最初のエンドツーエンドビデオ拡散変換器であるStableAvatarについて述べる。参照画像とオーディオを条件に、StableAvatarは、調整されたトレーニングモジュールと推論モジュールを統合して、無限長のビデオ生成を可能にする。我々は、既存のモデルが長いビデオを生成するのを防ぐ主な理由は、オーディオモデリングにあることを観察する。通常はサードパーティ製オフザシェルフ抽出器を使ってオーディオ埋め込みを取得し、それをクロスアテンションを通じて拡散モデルに直接注入する。現在の拡散バックボーンにはオーディオ関連の先行性がないため、この手法はビデオクリップ間で重大な潜時分布誤差が蓄積され、その後のセグメントの潜時分布が最適分布から徐々に遠ざかる。これを解決するため、StableAvatarはタイムステップ対応オーディオアダプタを導入し、タイムステップ対応の変調によるエラーの蓄積を防ぐ。動的誘導信号として拡散の進化するジョイントオーディオレイテンシ予測を活用することで、音声同期をさらに強化する新しいAudio Native Guidanceメカニズムを提案する。無限長ビデオの滑らかさを高めるため,時間とともに融合するダイナミックウェイト・スライディング・ウインドウ・ストラテジーを導入する。ベンチマーク実験では、定性的かつ定量的にスタブルアバターの有効性が示されている。

論文の概要: StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

関連論文リスト