Fugu-MT 論文翻訳(概要): InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

論文の概要: InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

arxiv url: http://arxiv.org/abs/2606.22905v1
Date: Mon, 22 Jun 2026 06:41:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 03:47:41.357686
Title: InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars
Title（参考訳）: Interactive Avatar: 一貫性とインテリジェントなアバターのためのリアルタイムストリーミングビデオ生成
Authors: Quanyue Song, Yishan He, Yanfei Zhang, Shihao Cheng, Zhixiang He, Zhizhi Guo, Chi Zhang, Xuelong Li, Caigui Jiang,
Abstract要約: 本研究では、視覚的に一貫したアバター映像生成と意図認識インタラクションをサポートするリアルタイム無限ストリーミングビデオ生成フレームワークを提案する。自己回帰蒸留により、InteractiveAvatarは、任意に長い期間にわたって、人間のアバターのリアルタイムなストレーミング生成を達成する。提案手法は,リアルタイムに複雑なユーザ・アバターインタラクションを実現するとともに,長周期生成における最先端の視覚的整合性を実現する。
参考スコア（独自算出の注目度）: 39.5461462800725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent diffusion-based models have enabled realistic audio-driven avatar generation in real-time streaming. However, existing approaches struggle to maintain visual temporal consistency and fail to explicitly perceive user intent in complex interactive streaming scenarios. To address these challenges, we propose InteractiveAvatar, a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, InteractiveAvatar achieves real-time str-eaming generation of human avatars over arbitrarily long durations. For visual consistency, we introduce a Long-Short Visual Memory (LSVM) mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, we propose a Reasoning-Reaction Module (RRM), which incorporates a State-Cycling strategy and a Cache-Switching mechanism. Extensive experimental results over diverse scenarios demonstrate that our method achieves state-of-the-art visual consistency in long-duration generation, while enabling complex user-avatar interaction in real time.
Abstract（参考訳）: 近年の拡散モデルにより,リアルタイムストリーミングにおける現実的なオーディオ駆動アバター生成が可能となった。しかし、既存のアプローチは、視覚的時間的一貫性を維持するのに苦労し、複雑なインタラクティブなストリーミングシナリオにおいて、ユーザの意図を明示的に知覚することができない。これらの課題に対処するために,視覚的に一貫したアバター映像生成と意図認識インタラクションをサポートするリアルタイム無限ストリーミングビデオ生成フレームワークであるInteractiveAvatarを提案する。自己回帰蒸留により、InteractiveAvatarは、任意に長い期間にわたって、人間のアバターのリアルタイムなストレーミング生成を達成する。視覚的整合性を確保するために,従来の視覚情報をコンパクトなトークンに柔軟に圧縮するLong-Short Visual Memory(LSVM)機構を導入する。本研究では,ユーザ意図に整合した音声やアクションを付加したアバターを生成するために,ステートサイクル戦略とキャッシュスイッチング機構を組み込んだReasoning-Reaction Module (RRM)を提案する。様々なシナリオに対する大規模な実験結果から,本手法は長期予測生成における最先端の視覚的整合性を実現し,複雑なユーザ・アバターインタラクションをリアルタイムに実現できることが示されている。

論文の概要: InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

関連論文リスト