Fugu-MT 論文翻訳(概要): Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

論文の概要: Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

arxiv url: http://arxiv.org/abs/2512.23650v2
Date: Sun, 04 Jan 2026 07:49:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.559303
Title: Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
Title（参考訳）: フリースタイルはあるか? 音声制御による表現型ヒューマノイドロコモーション
Authors: Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Tao Huang, Zhenguo Sun, Yibo Peng, Pengwei Wang, Zhongyuan Wang, Fangzhou Liu, Chang Xu, Shanghang Zhang,
Abstract要約: 現在のヒューマノイドロボットには表現力のある即興能力がなく、事前に定義された動きやスパースコマンドに限られている。音声から音楽駆動ダンスと音声駆動の音声合成ジェスチャーを直接生成できる,最初の統合型音声-音声合成フレームワークであるRoboPerformを提案する。 RoboPerformは、多様な動きパターンに適応するためのResMoEポリシーと、オーディオスタイル注入のための拡散ベースの学生ポリシーを実現している。
参考スコア（独自算出の注目度）: 52.83779852397341
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.
Abstract（参考訳）: 人間は直感的に音に動くが、現在のヒューマノイドロボットには表現力のある即興能力がなく、事前に定義された動きやまばらなコマンドに限られている。音声から動きを生成し、それをロボットに再ターゲティングすることは、明確な動きの再構成に頼っている。音声から音楽駆動ダンスと音声駆動の音声合成ジェスチャーを直接生成できる,最初の統合型音声-音声合成フレームワークであるRoboPerformを提案する。動作 = 内容 + スタイル」という基本原理で導かれたこのフレームワークは、音声を暗黙的なスタイルの信号として扱い、明示的な動き再構成の必要性を排除している。 RoboPerformは、多様な動きパターンに適応するためのResMoEの教師ポリシーと、オーディオスタイル注入のための拡散ベースの学生ポリシーを統合している。この再ターゲティングフリーな設計は、低レイテンシと高忠実性を保証する。実験による検証によると、RoboPerformは物理的な可視性とオーディオアライメントの有望な結果を達成し、ロボットを音声に反応可能なレスポンシブパフォーマーに変換することに成功した。

論文の概要: Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

関連論文リスト