Fugu-MT 論文翻訳(概要): LiveGesture Streamable Co-Speech Gesture Generation Model

論文の概要: LiveGesture Streamable Co-Speech Gesture Generation Model

arxiv url: http://arxiv.org/abs/2604.10927v1
Date: Mon, 13 Apr 2026 02:54:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.283399
Title: LiveGesture Streamable Co-Speech Gesture Generation Model
Title（参考訳）: LiveGesture Streamable Co-Speech Gesture Generation Model
Authors: Muhammad Usama Saleem, Mayur Jagdishbhai Patel, Ekkasit Pinyoanuntapong, Zhongxing Qin, Li Yang, Hongfei Xue, Ahmed Helmy, Chen Chen, Pu Wang,
Abstract要約: LiveGestureは、音声駆動のフルボディジェスチャー生成フレームワークである。ルックアヘッドはゼロで動作し、任意のシーケンス長をサポートする。一貫性があり、多様性があり、ビート同期のフルボディジェスチャーをリアルタイムで生成する。
参考スコア（独自算出の注目度）: 15.008891901028333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods, which are designed for offline generation and either treat body regions independently or entangle all joints within a single model, LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-expert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero look-ahead conditions.
Abstract（参考訳）: そこで我々はLiveGestureを提案する。LiveGestureは、ゼロルックアヘッドで動作し、任意のシーケンス長をサポートする、最初の完全にストリーミング可能な、音声駆動のフルボディジェスチャー生成フレームワークである。オフライン生成用に設計され、身体領域を独立に扱うか、1つのモデル内で全ての関節を絡める既存の音声ジェスチャーとは異なり、LiveGestureは、因果的、領域調整されたモーション生成のためにゼロから構築されている。 LiveGesture は Streamable Vector Quantized Motion Tokenizer (SVQ) と Hierarchical Autoregressive Transformer (HAR) の2つの主要モジュールで構成されている。 SVQトークンライザは、各身体領域の動作シーケンスを因果的、離散的な動作トークンに変換し、リアルタイム、ストリーム可能なトークン復号を可能にする。 SVQの上に、HARは、各体領域の表現的、きめ細かい運動力学をモデル化するために、領域専門の自己回帰変換器(xAR)を採用している。因果時空間融合モジュール(xAR Fusion)はその後、領域間の相関運動力学を捕捉し、統合する。 xARとxAR Fusionはどちらも、ストリーミング可能な因果オーディオエンコーダによって符号化された、継続的に到着するオーディオ信号に条件付けされている。ストリーミングノイズや予測誤差下でのロバスト性を高めるために,不確実性誘導トークンマスキングとランダム領域マスキングを利用した自己回帰マスキングトレーニングを導入する。 BEAT2データセットの実験では、LiveGestureは、真のゼロルックアヘッド条件下で、最先端のオフラインメソッドをマッチングまたは超越して、コヒーレントで多様性があり、ビート同期のフルボディジェスチャーをリアルタイムで生成することを示した。

論文の概要: LiveGesture Streamable Co-Speech Gesture Generation Model

関連論文リスト