Fugu-MT 論文翻訳(概要): UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

論文の概要: UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

arxiv url: http://arxiv.org/abs/2603.15975v1
Date: Mon, 16 Mar 2026 22:44:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.024352
Title: UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors
Title（参考訳）: UMO:モーションファウンデーションモデルに先行する統合インコンテキスト学習
Authors: Xiaoyan Cong, Zekun Li, Zhiyang Dou, Hongyu Li, Omid Taheri, Chuan Guo, Abhay Mittal, Sizhe An, Taku Komura, Wojciech Matusik, Michael J. Black, Srinath Sridhar,
Abstract要約: UMOは、様々な下流タスクを原子単位の操作の合成にキャストする、単純だが汎用的な統一的な定式化である。具体的には、フレーム単位のインテントを特定するために3つの学習可能なフレームレベルのメタオペレーション埋め込みを導入し、事前訓練されたバックボーンにコンテキスト内キューを注入するために、軽量の時間融合を採用している。 UMOは幅広いベンチマークでタスク固有のベースラインとトレーニング不要ベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 78.85130555487432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: https://oliver-cong02.github.io/UMO.github.io/
Abstract（参考訳）: 大規模ファウンデーションモデル(LFM)は最近、巨大な3Dモーションデータセットとペア化されたテキスト記述から強力な生成先を学習することで、テキスト・ツー・モーション生成を著しく進歩させた。しかし、より多様なクロスモーダル・イン・コンテクスト・モーション生成タスクにおいて、このような単一目的動作 LFM 、すなわちテキスト・ツー・モーション合成を効果的に効果的に活用する方法はほとんど不明である。以前の作業は、通常、訓練済みの生成前のタスクを、タスク固有の方法で個々の下流タスクに適応させる。対照的に、我々のゴールは、単一の統合フレームワーク内で幅広いダウンストリームモーション生成タスクをサポートするために、そのような事前をアンロックすることである。このギャップを埋めるために、UMOは、多種多様な下流タスクを原子単位の操作の合成にキャストし、事前訓練されたDiTベースの動き LFM の生成前をアンロックするためのコンテキスト内適応を可能にする、単純だが汎用的な統一的な定式化である。具体的には、フレーム単位のインテントを特定するために3つの学習可能なフレームレベルのメタオペレーション埋め込みを導入し、ベースモデルと比較して無視可能なランタイムオーバーヘッドを伴って、コンテクスト内キューを予めトレーニングされたバックボーンに注入するために、軽量の時間融合を採用している。この設計により、UMOは、当初テキスト・ツー・モーション生成に限られていた事前訓練されたモデルを微調整し、時間的インペイント、テキスト誘導モーション編集、テキストシリアライズド幾何制約、マルチアイデンティティ・リアクション生成などの様々なタスクをサポートする。 UMOは、単一の統一モデルを使用しても、幅広いベンチマークでタスク固有のベースラインとトレーニング不要ベースラインを一貫して上回ることを示した。コードとモデルは公開されます。 Project Page: https://oliver-cong02.github.io/UMO.github.io/

論文の概要: UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

関連論文リスト