Fugu-MT 論文翻訳(概要): Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

論文の概要: Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

arxiv url: http://arxiv.org/abs/2606.25041v1
Date: Tue, 23 Jun 2026 18:01:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:05:30.109616
Title: Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Title（参考訳）: Wan-Streamer v0.1: エンドツーエンドのインタラクティブファンデーションモデル
Authors: Lianghua Huang, Zhifan Wu, Wei Wang, Yupeng Shi, Mengyang Feng, Junjie He, Chenwei Xie, Yu Liu, Jingren Zhou, Ang Wang, Bang Zhang, Baole Ai, Chen Liang, Cheng Yu, Chongyang Zhong, Jinwei Qi, Kai Zhu, Pandeng Li, Peng Zhang, Wenyuan Zhang, Xinhua Cheng, Yitong Huang, Yun Zheng, Zoubin Bi,
Abstract要約: Wan-Streamerは、ローストリーミングインタラクションのためのエンドツーエンドのインタラクティブ基盤モデルである。音声とビデオは、入力と出力の両方を単一のTransformerシーケンスでシームレスにモデル化する。およそ200msのモデル側レスポンスレイテンシと、合計550msのインタラクションレイテンシを実現している。
参考スコア（独自算出の注目度）: 66.03724575571962
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.
Abstract（参考訳）: We present Wan-Streamer, a native-streaming, end-to-end Interactive foundation model designed to real-time, low-latency, full-duplex audio-visual interaction。 Wan-Streamerは言語、音声、ビデオを単一のトランスフォーマー内で入力と出力の両方としてシームレスにモデル化し、シーケンスはインターリーブされたビジュアル、オーディオ、テキスト入力トークンとして表現され、ビジュアル、オーディオ、テキスト出力トークンとともに、インクリメンタルストリーミングのためにブロック・カウサルの注意によって調整される。別個のVAD、ASR、言語、TS、オーディオ駆動アニメーション、ビデオ生成モジュールに依存するケースケードの対話システムとは異なり、Wan-Streamerは外部言語、音声、アバター、あるいはビデオ生成モジュールに依存しない。自然な音声・視覚応答性をサポートするため,カソーサルエンコーダ,因果デコーダ,ブロック・カソーサルアテンション,低遅延マルチモーダルトークンスケジューリングなど,ストリーム性を中心としたスタック全体を再設計し,25fpsで160msのストリーミングを可能とした。 Wan-Streamerは、350msの双方向ネットワーク遅延と組み合わせることで、約200msのモデル側応答レイテンシと約550msのインタラクションレイテンシを実現し、サブ秒間二重オーディオ・ビジュアル通信をサポートする。これらの結果は、Wan-Streamerを低レイテンシなストリーミングインタラクションのための統一、エンドツーエンド、マルチモーダルインタラクティブ基盤モデルとして位置づける。

論文の概要: Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

関連論文リスト