Fugu-MT 論文翻訳(概要): MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

論文の概要: MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

arxiv url: http://arxiv.org/abs/2407.21136v3
Date: Sun, 25 Aug 2024 07:35:04 GMT
ステータス: 翻訳完了
システム内更新日: 2024-08-27 20:50:26.519350
Title: MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls
Title（参考訳）: MotionCraft: プラグイン・アンド・プレイのマルチモーダル制御による全身動作の製作
Authors: Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, Qiang Xu,
Abstract要約: プラグ・アンド・プレイ・マルチモーダル制御による全身動作を実現する統合拡散変換器であるMotionCraftを提案する。我々のフレームワークは、テキスト・ツー・モーション・セマンティック・トレーニングの第1段階から始まる粗大な訓練戦略を採用している。本稿では,SMPL-Xフォーマットを統一したマルチモーダル全体モーション生成ベンチマークMC-Benchを紹介する。
参考スコア（独自算出の注目度）: 30.487510829107908
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.
Abstract（参考訳）: テキスト、音声、音楽によって制御される全身のマルチモーダルモーション生成は、ビデオ生成やキャラクターアニメーションを含む多くの応用がある。しかし、異なる条件で様々な生成タスクを達成するために統一されたモデルを用いることで、異なるタスク(例えば、共同音声ジェスチャーやテキスト駆動の日々の行動)にわたる動き分布のドリフトと、様々な粒度の混合条件(例えば、テキストや音声)の複雑な最適化の2つの主な課題が提示される。さらに、異なるタスクやデータセットにわたる一貫性のないモーションフォーマットは、マルチモーダルモーション生成に対する効果的なトレーニングを妨げる。本稿では,プラグイン・アンド・プレイマルチモーダル制御による全身動作を実現する統合拡散トランスフォーマであるMotionCraftを提案する。本フレームワークでは,テキスト・ツー・モーション・セマンティック・プレトレーニングの第1段階から始まり,さまざまな粒度の条件に対処するマルチモーダル・ローレベル・コントロール・アダプティブの第2段階まで,粗大な訓練戦略を採用している。そこで我々は,静的および動的トポロジーグラフの並列モデリングのためのMC-Attnを設計した。既存のベンチマークの動作フォーマットの不整合を克服するため,SMPL-Xフォーマットを統一したマルチモーダル全体の動作生成ベンチマークであるMC-Benchを導入する。大規模な実験により、MotionCraftは様々な標準モーション生成タスクで最先端のパフォーマンスを達成することが示された。

関連論文リスト

GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation [19.2804620329011]
Generative Pretrained Multi-path Motion Model (GenM$3$)は、統合された動き表現を学習するためのフレームワークである。大規模なトレーニングを可能にするため、11の高品質なモーションデータセットを統合し、統合する。 GenM$3$はHumanML3Dベンチマークで0.035の最先端のFIDを実現し、最先端のメソッドを大きなマージンで上回る。
論文参考訳（メタデータ） (2025-03-19T05:56:52Z)
PackDiT: Joint Human Motion and Text Generation via Mutual Prompting [22.53146582495341]
PackDiTは、様々なタスクを同時に実行できる最初の拡散ベースの生成モデルである。我々はHumanML3Dデータセット上でPackDiTをトレーニングし、FIDスコア0.106で最先端のテキスト・トゥ・モーションのパフォーマンスを達成する。さらに本実験は, 拡散モデルが, 自動回帰モデルに匹敵する性能を達成し, 動画像生成に有効であることを示す。
論文参考訳（メタデータ） (2025-01-27T22:51:45Z)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2は、MLMLM(Large Motion-Language Model)である。 LLM(Large Language Models)によるマルチモーダル制御をサポートしている。難易度の高い3次元全体運動生成タスクに高い適応性を持つ。
論文参考訳（メタデータ） (2024-10-29T05:25:34Z)
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agentは、一般的な人間の動きの生成、編集、理解のために設計された会話フレームワークである。 Motion-Agentはオープンソースの事前学習言語モデルを使用して、モーションとテキストのギャップを埋める生成エージェントであるMotionLLMを開発した。
論文参考訳（メタデータ） (2024-05-27T09:57:51Z)
M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation [78.77004913030285]
M$3$GPTは、理解と生成のための先進的な$textbfM$ultimodal, $textbfM$ultitaskフレームワークである。我々は、テキスト、音楽、モーション/ダンスなどのマルチモーダルな条件信号に対して離散ベクトル量子化を用い、大きな言語モデルへのシームレスな統合を可能にした。 M$3$GPTは、様々な動作関連タスク間の接続とシナジーをモデル化することを学ぶ。
論文参考訳（メタデータ） (2024-05-25T15:21:59Z)
Large Motion Model for Unified Multi-Modal Motion Generation [50.56268006354396]
Large Motion Model (LMM) は、動き中心のマルチモーダルフレームワークであり、メインストリームのモーション生成タスクをジェネラリストモデルに統合する。 LMMは3つの原則的な側面からこれらの課題に取り組む。
論文参考訳（メタデータ） (2024-04-01T17:55:11Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
本稿では,マルチモーダル入力を管理する新しい手法であるScene and Motion Conditional Diffusion (SMCD)を紹介する。 SMCDは、認識されたモーションコンディショニングモジュールを組み込み、シーン条件を統合するための様々なアプローチを調査する。我々のデザインは映像の品質、動きの精度、セマンティック・コヒーレンスを大幅に向上させる。
論文参考訳（メタデータ） (2024-03-15T10:36:24Z)
MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations [25.630268570049708]
MoConVQは、スケーラブルな離散表現を活用する物理ベースのモーションコントロールのための新しい統合フレームワークである。提案手法は,数十時間の動作例にまたがる大規模非構造データセットから,効果的に動作埋め込みを学習する。
論文参考訳（メタデータ） (2023-10-16T09:09:02Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
テキスト記述に基づく高品質な人間の動作を合成するための新しいアプローチであるDiverseMotionを提案する。我々のDiverseMotionは、最先端のモーション品質と競争力の多様性を達成できることを示す。
論文参考訳（メタデータ） (2023-09-04T05:43:48Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。