Fugu-MT 論文翻訳(概要): LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

論文の概要: LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

arxiv url: http://arxiv.org/abs/2602.12370v1
Date: Thu, 12 Feb 2026 20:02:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-16 23:37:53.731024
Title: LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
Title（参考訳）: LLaMo: 連続的自己回帰トークンによる一元化動作理解と生成のための事前学習言語モデルのスケーリング
Authors: Zekun Li, Sizhe An, Chengcheng Tang, Chuan Guo, Ivan Shugurov, Linguang Zhang, Amy Zhao, Srinath Sridhar, Lingling Tao, Abhay Mittal,
Abstract要約: LLaMoは、モダリティ固有のMixture-of-Transformersアーキテクチャを通じて、事前訓練された大規模言語モデルを拡張するフレームワークである。人間の動きを因果連続潜伏空間にエンコードし、デコーダのみのバックボーンで次のトーケン予測パラダイムを維持する。実験により,LLaMoは一般的な設定で高忠実なテキスト・ツー・モーション生成とモーション・トゥ・テキストキャプションを実現することが示された。
参考スコア（独自算出の注目度）: 19.167250154665812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.
Abstract（参考訳）: 大規模モデルの最近の進歩は、統合マルチモーダル生成と理解に大きな進歩をもたらした。しかし、動き言語の生成と理解を統一するモデルの開発は、いまだに未熟なままである。既存のアプローチでは、ペアのモーションテキストデータに対して、大きな言語モデル(LLM)を微調整することが多いため、利用可能なテキスト-モーションペアの規模が限られているため、言語機能を壊滅的に忘れてしまう可能性がある。さらに、従来の手法では、動きを量子化によって離散表現に変換して言語モデルと統合し、離散トークン化からかなりのジッタアーティファクトを導入するのが一般的である。これらの課題に対処するために,LLaMoを提案する。LLaMoは,モダリティ固有のMixture-of-Transformers (MoT)アーキテクチャを通じて,事前学習したLLMを拡張した統合フレームワークである。この設計は本質的にベースモデルの言語理解を保ちつつ、スケーラブルなマルチモーダル適応を可能にしている。人間の動きを因果連続潜伏空間にエンコードし、軽量なフローマッチングヘッドを介してデコーダのみのバックボーンにおける次のトーン予測パラダイムを維持し、リアルタイム(>30 FPS)のストリーミングモーション生成を可能にする。事前学習されたLLMの包括的言語理解と大規模モーションテキスト事前学習を活用して,LLaMoは一般的な設定,特にゼロショットモーション生成において,高忠実なテキスト・トゥ・モーション生成とモーション・トゥ・テキストキャプションを実現し,汎用的なモーション言語大モデルに向けた重要なステップとなることを実証した。

関連論文リスト

DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding [25.254783224309488]
マスク付きモデリングからテキスト-モーション理解と生成まで拡張した,離散拡散スタイルのフレームワークであるDiMoを提案する。動きをトークン化し、順次デコードするGPTスタイルの自己回帰アプローチとは異なり、DiMoは繰り返しマスク付きトークン精錬を行う。 HumanML3DとKIT-MLの実験は、強い運動品質と競合する双方向理解を示す。
論文参考訳（メタデータ） (2026-02-04T04:01:02Z)
Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridgeは純粋な自己回帰統合MLLMであり、学習済みの視覚的理解モデルを生成能力で強化する。本稿では,コンパクトなセマンティックトークンと微細なピクセルトークンを統合するセマンティック・ツー・ピクセルの離散表現を提案する。
論文参考訳（メタデータ） (2025-10-02T00:40:02Z)
Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
我々は,世界モデリングのための新しいパラダイム--Primitive Embodied World Models (PEWM)を提案する。ビデオ生成を固定的な短地平線に制限することにより,ロボット行動の言語概念と視覚的表現の微妙なアライメントを可能にする。我々のフレームワークは、きめ細かい物理的相互作用と高レベルの推論のギャップを埋め、スケーラブルで解釈可能で汎用的なインテリジェンスへの道を開く。
論文参考訳（メタデータ） (2025-08-28T14:31:48Z)
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities [36.42160163142448]
MG-MotionLLMは多粒運動の理解と生成のための統一運動言語モデルである。本稿では,新しい補助課題を取り入れた包括的多粒度学習手法を提案する。 MG-MotionLLMは,従来のテキスト・トゥ・モーションタスクやモーション・トゥ・テキストタスクにおいて優れた性能を発揮する。
論文参考訳（メタデータ） (2025-04-03T10:53:41Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
画像の理解と生成の両方が可能なシンプルだが強力なエンコーダのないMLLMであるSynerGen-VLを提案する。トークンの折り畳み機構と,高分解能画像理解を効果的に支援するビジョンエキスパートベースのプログレッシブアライメント事前学習戦略を導入する。コードとモデルはリリースされます。
論文参考訳（メタデータ） (2024-12-12T18:59:26Z)
MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks [30.333659816277823]
動作とテキストの限界,条件,共同分布を同時に学習することで,多様なタスクを処理できる統合マルチモーダルモデルであるtextbfMoTe を提示する。 MoTeは3つのコンポーネントで構成されている: Motion-Decoder (MED)、Text-Decoder (TED)、Moti-on-Text Diffusion Model (MTDM)。
論文参考訳（メタデータ） (2024-11-29T15:48:24Z)
VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension [26.172040706657235]
本稿では,VQ-VAEとフローマッチングを組み合わせた新しいモーショントークンと,自動回帰変換器のバックボーンを組み合わせた統合モーションLLMであるVersatileMotionを紹介する。 VersatileMotionは、単一のフレームワークで単一エージェントとマルチエージェントの動作を処理する最初の方法であり、7つのタスクで最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (2024-11-26T11:28:01Z)
Human Motion Instruction Tuning [37.3026760535819]
本稿では,人間の動作指導のためのフレームワークであるLLaMoについて述べる。 LLaMoは、命令チューニングのためのネイティブフォームで動作を保持します。ビデオデータとモーションデータをテキスト入力と共に処理することで、LLaMoは柔軟な人間中心の分析を可能にする。
論文参考訳（メタデータ） (2024-11-25T14:38:43Z)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2は、MLMLM(Large Motion-Language Model)である。 LLM(Large Language Models)によるマルチモーダル制御をサポートしている。難易度の高い3次元全体運動生成タスクに高い適応性を持つ。
論文参考訳（メタデータ） (2024-10-29T05:25:34Z)
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agentは、一般的な人間の動きの生成、編集、理解のために設計された会話フレームワークである。 Motion-Agentはオープンソースの事前学習言語モデルを使用して、モーションとテキストのギャップを埋める生成エージェントであるMotionLLMを開発した。
論文参考訳（メタデータ） (2024-05-27T09:57:51Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
テキスト記述に基づく高品質な人間の動作を合成するための新しいアプローチであるDiverseMotionを提案する。我々のDiverseMotionは、最先端のモーション品質と競争力の多様性を達成できることを示す。
論文参考訳（メタデータ） (2023-09-04T05:43:48Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。