Fugu-MT 論文翻訳(概要): UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

論文の概要: UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

arxiv url: http://arxiv.org/abs/2605.14731v1
Date: Thu, 14 May 2026 11:56:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.802818
Title: UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars
Title（参考訳）: UMo:リアルタイムコ音声アバターのための統一スパース運動モデリング
Authors: Xiaoyu Zhan, Xinyu Fu, Chenghao Yang, Xiaohong Zhang, Dongjie Fu, Pengcheng Fang, Tengjiao Sun, Xiaohao Cai, Hansung Kim, Yuanqi Li, Jie Guo, Yanwen Guo,
Abstract要約: UMoは、リアルタイム音声アバターのための統一されたスパースモーションモデリングアーキテクチャである。リアルタイムの高密度再構成を効率よく行い、時間的コヒーレントかつ高忠実なアニメーション生成を可能にする。厳密なレイテンシ制約の下でも、きめ細かな音声-動きのアライメントを保ちます。
参考スコア（独自算出の注目度）: 25.55654497627044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.
Abstract（参考訳）: 音声によるジェスチャーと顔のアニメーションは、ゲーム、バーチャルプロダクション、インタラクティブメディアにおける表現力のあるデジタルアバターの基本である。しかし、既存の手法は、音声のアライメントの単一モードに制限されており、巨大な人間のモーションデータのポテンシャルを十分に活用できないか、マルチモーダルモデルの表現能力やスループットに制約されているため、高品質なモーション生成やリアルタイムのパフォーマンスを達成するのが困難である。我々は,テキスト,音声,およびモーショントークンを統一的な定式化の中で処理する,リアルタイム音声アバターのための統一されたスパース動作モデリングアーキテクチャであるUMoを提案する。空間的にスパースなMixture-of-Expertsフレームワークと、時間的にスパースなキーフレーム中心の設計を活用して、UMoはリアルタイムに高密度な再構成を行い、顔の表情とジェスチャーの両方に対して時間的にコヒーレントかつ高忠実なアニメーション生成を可能にする。さらに,音質の多様性と意味的整合性を高めるために,ターゲット音の増大を目標とした多段階学習戦略を実装した。したがって、UMoは、厳密なレイテンシ制約の下でも、きめ細かな音声-動きのアライメントを保っている。広汎な定量的および定性的評価により、UMoは低レイテンシおよびリアルタイム性能制約下での出力品質の向上を実現し、高忠実なリアルタイムコ音声アバターの実用的なソリューションを提供する。

論文の概要: UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

関連論文リスト