Fugu-MT 論文翻訳(概要): LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

論文の概要: LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

arxiv url: http://arxiv.org/abs/2509.25304v1
Date: Mon, 29 Sep 2025 17:58:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:44:59.919879
Title: LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model
Title（参考訳）: LUMA:テキスト・ツー・モーション拡散モデルのためのデュアルパスアンカリングを用いた低次元統一運動アライメント
Authors: Haozhe Jia, Wenshuo Chen, Yuqi Lin, Yang Yang, Lei Wang, Mang Ning, Bowen Tian, Songning Lai, Nanqian Jia, Yifan Chen, Yutao Yue,
Abstract要約: 本稿では,2経路アンカーを組み込んだテキスト・ツー・モーション拡散モデルを提案し,セマンティックアライメントを強化する。 FIDスコアはそれぞれ0.035と0.123である。
参考スコア（独自算出の注目度）: 18.564067196226436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.
Abstract（参考訳）: 現在の拡散ベースのモデルは、通常U-Netアーキテクチャに基づいて構築されているが、テキスト・トゥ・モーション生成タスクにおいて有望な結果を示しているが、それでも意味的ミスアライメントとキネマティックアーティファクトに悩まされている。解析により、ネットワークの深い層における厳密な勾配減衰を重要なボトルネックとし、ハイレベルな特徴の学習が不十分となる。この問題に対処するため,両経路アンカーを組み込んだテキスト間拡散モデルである \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment} を提案する。最初のパスでは、外部データに頼ることなく、コントラスト学習を通じてトレーニングされた軽量なMoCLIPモデルが組み込まれ、時間領域における意味的な監視を提供する。第2の経路は、そのリッチなセマンティックコンテンツで知られる低周波DCT成分から抽出された周波数領域における相補的なアライメント信号を導入する。これら2つのアンカーは時間的変調機構によって適応的に融合し、モデルが粗いアライメントからよりきめ細かなセマンティックリファインメントへと段階的に遷移することを可能にする。また,HumanML3DとKIT-MLの実験結果から,LUMAのFIDスコアは0.035,0.123であった。さらに、LUMAはベースラインと比較して1.4$\times$の収束を加速し、高忠実なテキスト-モーション生成のための効率的でスケーラブルなソリューションとなる。

論文の概要: LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

関連論文リスト