Fugu-MT 論文翻訳(概要): StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

論文の概要: StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

arxiv url: http://arxiv.org/abs/2605.25659v1
Date: Mon, 25 May 2026 10:04:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.641861
Title: StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration
Title（参考訳）: StreamChar: 分離オーケストレーションによる長軸ストリーミングキャラクタのオーディオビデオ生成
Authors: Linrui Tian, Qi Wang, Bang Zhang,
Abstract要約: StreamCharは,短時間のオーディオビデオから長期のオーケストレーションを分離するストリーミングフレームワークである。ショートクリップおよびロングホライゾンプロトコルの実験は、StreamCharが1つのH100 GPU上でリアルタイムに実行されることを示している。
参考スコア（独自算出の注目度）: 16.23723735702324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.
Abstract（参考訳）: キャラクタアニメーションのためのリアルタイムストリーミングジョイントオーディオビデオ生成には、要求された書き起こしを話し、チャンクをまたいで視覚的アイデンティティを保持し、厳格な再生予算内で実行する必要がある。これらの要件を同時に満たすことは困難である:チャンクワイド自己回帰生成はトランスクリプト・オーディオのミスアライメントと視覚的ドリフトを蓄積でき、低レイテンシに必要な数ステップの蒸留は空間的多様性と時間的品質を低下させる。 StreamCharは,短時間のオーディオビデオから長期のオーケストレーションを分離するストリーミングフレームワークである。 LLMベースのオーケストレータは、書き起こしと履歴のコンテキストを用いてフレーム整列オーディオ条件を生成し、ジョイントオーディオビデオDiTは、参照およびモーションフレーム条件付きで、局所的な双方向化を行う。効率的な展開には、2段階の蒸留パイプラインを使用し、まずサンプルを圧縮し、次にオンラインのチャンクロールアウトで生徒を微調整する。プログレッシブ・アウェア・ポインタは、ロールアウトトレーニング中に生成されたオーディオと部分的書き起こしを調整し、シンク・チャンクメモリは、長時間水平ドリフトを減らすための永続的な視覚的アンカーを提供する。ショートクリップとロングホライゾンプロトコルの実験では、StreamCharは1つのH100 GPU上でリアルタイムに動作し、最近の関節およびオーディオ駆動ベースラインと比較して、トランスクリプトの忠実さ、オーディオ-視覚同期、視覚的品質、ストリーミング安定性の間で良好なシステムレベルのトレードオフを提供する。

論文の概要: StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

関連論文リスト