Fugu-MT 論文翻訳(概要): MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

論文の概要: MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

arxiv url: http://arxiv.org/abs/2605.18956v1
Date: Mon, 18 May 2026 18:00:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.89634
Title: MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation
Title（参考訳）: MotionMERGE: 人間の動作編集、推論、生成、説明のための多言語フレームワーク
Authors: Bizhu Wu, Jinheng Xie, Wenting Chen, Zhe Kong, Jianfeng Ren, Linlin Shen, Ruibin Bai, Rong Qu,
Abstract要約: MotionMERGEは、モーション言語モデルの粒度のギャップを埋める統合フレームワークである。まず,詳細な理解と局所的な編集を含む,粒度の細かい言語誘導型モーションコントロールの研究の先駆者となる。第2に,粒度調整を共同で行う新しい戦略である粒度事前学習を意識したReasoningAware Granularity-Synergyを設計する。第3に、第1の微細時間補正命令とモーショングラウンドCoTアノテーションを備えた大規模データセットであるMotionFineEditをキュレートする。
参考スコア（独自算出の注目度）: 66.66098171359995
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.
Abstract（参考訳）: 最近のモーション言語モデルは、理解や生成のようなタスクを統一するが、粗い粒度で操作する。これは、モデルとデータの両方において、モデルがモーションのローカライズされたパターンに集中できず、トレーニングデータがきめ細かい監督を欠いているという根本的な問題に起因しています。そこで我々は,粒度ギャップを埋める統一フレームワークであるMotionMERGEを提案する。まず,1つのLDM内における動きと時間レベルを明示的にモデル化することにより,詳細な理解と局所的な編集を含む粒度の細かい言語誘導型動作制御の研究を開拓し,より正確な制御を行うための頑健な先行モデルを与える。第2にReasoning Aware Granularity-Synergy pre-trainingを設計する。これは、クロスグラニュアリティアライメント、時間的接地、局所的アライメント、動きコヒーレンシー、そして運動グラウンドド・チェーン・オブ・シント(CoT)推論に共同で監督する新しい戦略である。このモデルには、微粒な動き言語アライメント、クロスグラニュラ性シナジー、明示的な推論能力が備わっている。第3に、大規模なデータセットであるMotionFineEdit (837K Atomic + 144K Complex Trilet) を、最初の微細な時空間補正命令とモーショングラウンドのCoTアノテーションでキュレートし、テキスト駆動モーション編集とモーショングラウンドの推論のための新しいベンチマークを構築した。広範囲にわたる実験は、より正確なモーション生成、理解、編集のためのMotionMERGEの能力を実証し、他の複雑なモーションタスクへの説得力のあるゼロショットの一般化を示した。この研究は、より粒度の細かい動きや人間のような推論と相互作用するモデルに向けた重要なステップである。

論文の概要: MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

関連論文リスト