Fugu-MT 論文翻訳(概要): MotionVLA: Vision-Language-Action Model for Humanoid Motion

論文の概要: MotionVLA: Vision-Language-Action Model for Humanoid Motion

arxiv url: http://arxiv.org/abs/2606.15142v1
Date: Sat, 13 Jun 2026 06:10:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:32.87359
Title: MotionVLA: Vision-Language-Action Model for Humanoid Motion
Title（参考訳）: MotionVLA:ヒューマノイド運動の視覚言語行動モデル
Authors: Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang,
Abstract要約: 動作をベースストリームと物理ストリームに分離するデュアルストリーム周波数トークンであるDSFTを提案する。また、ベースおよび物理トークンを統一シーケンスに配置するQwen3.5ベースのモデルであるMotionVLAを提案する。
参考スコア（独自算出の注目度）: 54.785960777274276
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.
Abstract（参考訳）: シーン画像とテキストからリアルなヒューマノイドの動きを生成するには、低頻度ポーズセマンティクスと高周波物理力学の両方が含まれる。しかし、既存の多くの方法は単一の共有コードブックで動きをトークン化し、不均一な動き信号を同じ量子化空間に強制する。 5つのDCT係数はジョイントポジションエネルギーの93%を占めるが、ジョイント速度エネルギーの37%しか得られない。第2の課題は、モーションシーケンス内の高周波物理信号を効果的にモデル化するために、標準自己回帰モデルを適用することである。そこで本研究では,動作をベースストリームと物理ストリームに分離し,DCTトランケーションとBPEと独立に圧縮するデュアルストリーム周波数トークンであるDSFTを提案する。さらに、ベーストークンと物理トークンを統一シーケンスに配置するQwen3.5ベースのモデルであるMotionVLAを提案し、そこでPhysトークンはベーストークンの後に予測される。 HumanML3DとMBenchの実験では、軽量な2Bバックボーンを使用しても、MotionVLAはHumanML3Dで50%以上の実際のデータへのダイバーシティギャップを減らし、MBenchで3.8%のモーションコンディション一貫性を改善し、自己回帰運動生成の効果的な定式化として周波数対応のデュアルストリームデカップリングをサポートする。コード:https://github.com/AIGeeksGroup/MotionVLA。ウェブサイト:https://aigeeksgroup.github.io/MotionVLA

論文の概要: MotionVLA: Vision-Language-Action Model for Humanoid Motion

関連論文リスト