Fugu-MT 論文翻訳(概要): Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

論文の概要: Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

arxiv url: http://arxiv.org/abs/2605.14935v1
Date: Thu, 14 May 2026 15:09:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.900903
Title: Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control
Title（参考訳）: 実時間人間の動作制御のためのマルチスケール粗粒度モデリング
Authors: Nhat Le, Daochang Liu, Anh Nguyen, Ajmal Mian,
Abstract要約: MSCoTは、テストタイムの人間のモーション合成と制御のための、マルチスケールで粗い粒度モデルである。 MSCoTは動きを多スケールの階層表現に識別し、各時間スケールでトークンシーケンス全体を予測する。
参考スコア（独自算出の注目度）: 51.92884966472683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and $10 \times$ faster inference speed on HumanML3D.
Abstract（参考訳）: テスト時間人間の動作合成と制御のためのマルチスケール粗大度モデルMSCoTを提案する。特定の制御信号用に調整されたモジュールや複数の反復的デノイング/トケン予測ステップに依存する最近のアプローチとは異なり、MSCoTは動きをマルチスケールの階層表現に識別し、各時間スケールのトークンシーケンス全体を粗い方法で予測する。この粗大なパラダイムに基づいて、離散サンプリングの課題を克服し、トークン分布を制御目標に向けて制御し、高速かつ柔軟な制御を可能にする、効率的なマルチスケールトークン誘導戦略を提案する。離散的なコードブックの限界に対処するため、軽量なトークン精錬器は離散的なトークン埋め込みにさらに連続的な残差を追加し、異なるテスト時間精錬最適化を可能にし、制御対象との正確な整合性を確保する。 MSCoTは、拡散ベースのアプローチよりもはるかに高速なサンプリングを提供しながら、制御制約に整合した高品質な動作を生成することができる。一般的なベンチマークの実験では、既存のベースラインに対するMSCoTの最先端の制御可能なテキスト・ツー・モーション生成性能が、より優れたモーション品質(48%のFID改善)、より高い制御精度(61%のavgエラー)、そしてHumanML3Dにおける10 \times$高速な推論速度で示されている。

論文の概要: Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

関連論文リスト