Fugu-MT 論文翻訳(概要): OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

論文の概要: OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

arxiv url: http://arxiv.org/abs/2510.19789v1
Date: Wed, 22 Oct 2025 17:25:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:16.228299
Title: OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation
Title（参考訳）: OmniMotion-X:Versatile Multimodal Whole-Body Motion Generation
Authors: Guowei Xu, Yuxuan Bian, Ailing Zeng, Mingyi Shi, Shaoli Huang, Wen Li, Lixin Duan, Qiang Xu,
Abstract要約: 本稿では,全身動作生成のための汎用フレームワークであるOmniMotion-Xを紹介する。 OmniMotion-Xは、テキスト・トゥ・モーション、音楽・トゥ・ダンス、音声・トゥ・ジェスチャなど、多様なマルチモーダルタスクを効率的にサポートする。高品質なマルチモーダルトレーニングを実現するため,これまでで最大の統合マルチモーダルモーションデータセットであるOmniMoCap-Xを構築した。
参考スコア（独自算出の注目度）: 52.579531290307926
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.
Abstract（参考訳）: 本稿では, 自己回帰拡散変換器を一貫したシーケンス・ツー・シーケンス方式で活用し, 全身動作生成のための汎用マルチモーダルフレームワークであるOmniMotion-Xを提案する。 OmniMotion-Xは、テキスト・トゥ・モーション、音楽・ツー・ダンス、音声・ジェスチャー、グローバルな時空間制御シナリオ(例えば、モーション予測、イン・ベントワイニング、完了、関節/軌道誘導合成)を含む多様なマルチモーダルタスクを効率的にサポートし、これらのタスクの柔軟な組み合わせもサポートする。具体的には,レファレンスモーションを新しい条件付け信号として用いることで,現実的なアニメーションに不可欠な生成内容,スタイル,時間的ダイナミクスの一貫性を大幅に向上させる。マルチモーダル・コンフリクトに対処するために, プログレッシブ・アンド・ストロング・ミックスコンディション・トレーニング戦略を導入する。高品質なマルチモーダルトレーニングを実現するため,これまでで最大規模の統一マルチモーダルモーションデータセットであるOmniMoCap-Xを構築した。詳細かつ一貫性のあるアノテーションを保証するため、ビデオにシーケンスを描画し、GPT-4oを使用して構造化された階層的なキャプションを自動的に生成し、低レベルのアクションと高レベルのセマンティクスの両方をキャプチャする。 OmniMotion-Xは既存の手法をはるかに上回り、複数のマルチモーダルタスクにまたがる最先端のパフォーマンスを示し、リアルでコヒーレントで制御可能な長期動作のインタラクティブな生成を可能にする。

論文の概要: OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

関連論文リスト