Fugu-MT 論文翻訳(概要): X-Streamer: Unified Human World Modeling with Audiovisual Interaction

論文の概要: X-Streamer: Unified Human World Modeling with Audiovisual Interaction

arxiv url: http://arxiv.org/abs/2509.21574v1
Date: Thu, 25 Sep 2025 20:53:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.007688
Title: X-Streamer: Unified Human World Modeling with Audiovisual Interaction
Title（参考訳）: X-Streamer: 視覚インタラクションを用いた統一されたヒューマンワールドモデリング
Authors: You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, Linjie Luo,
Abstract要約: X-Streamerは、テキスト、音声、ビデオ間の無限の相互作用が可能なデジタルヒューマンエージェントを構築するためのフレームワークである。中心となるのは、マルチモーダル理解と生成を統一するThinker-Actorデュアルトランスフォーマーアーキテクチャである。 X-Streamerは2つのA100 GPU上でリアルタイムに動作し、一貫したビデオチャット体験を数時間持続する。
参考スコア（独自算出の注目度）: 36.50697656708077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.
Abstract（参考訳）: X-Streamerは、テキスト、音声、ビデオ間の無限の相互作用が可能なデジタルヒューマンエージェントを単一の統一アーキテクチャで構築するための、エンドツーエンドのマルチモーダルヒューマンワールドモデリングフレームワークである。単一のポートレートから始まるX-Streamerは、マルチモーダル入力をストリーミングすることによって、リアルタイムでオープンなビデオ通話を可能にする。中心となるのはThinker-Actorのデュアルトランスフォーマーアーキテクチャで、マルチモーダルな理解と生成を統一し、静的なポートレートを永続的でインテリジェントなオーディオ視覚インタラクションに変換する。 Thinkerモジュールはストリーミングユーザの入力よりも原因を認識し、隠れた状態はアクターによってリアルタイムで同期されたマルチモーダルストリームに変換される。具体的には、Thinkerは事前訓練された大きな言語音声モデルを活用する一方、アクターはチャンクワイドな自己回帰拡散モデルを使用して、Thinkerの隠された状態と交差して、インターリーブされた離散テキストとオーディオトークンと連続したビデオラテントで、タイムアラインなマルチモーダル応答を生成する。長時間の水平安定性を確保するため,細粒度相互モーダルアライメントとコンテキスト保持のための時間整列型マルチモーダル埋め込みによるチャンク間およびチャンク間アテンションを設計し,さらにチャンクワイド・ディフュージョン・フォースとグローバル・アイデンティティ・レファレンスにより強化した。 X-Streamerは2つのA100 GPU上でリアルタイムで動作し、任意のポートレートから何時間も一貫したビデオチャット体験を維持し、インタラクティブなデジタル人間の世界モデリングへの道を歩む。

論文の概要: X-Streamer: Unified Human World Modeling with Audiovisual Interaction

関連論文リスト