Fugu-MT 論文翻訳(概要): M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production

論文の概要: M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production

arxiv url: http://arxiv.org/abs/2603.23617v1
Date: Tue, 24 Mar 2026 18:05:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:10.976275
Title: M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production
Title（参考訳）: M3T:手話生成のためのマルチモーダルモーショントークンの離散化
Authors: Alexandre Symeonidis-Herzig, Jianhe Low, Ozge Mercanoglu Sincan, Richard Bowden,
Abstract要約: 非手動的特徴として, 口づけ, まぶたの上昇, 視線, 頭部運動は文法的に義務付けられ, 手動調音器のみでは回復できない。既存の3Dプロダクションシステムは、それらを統合するための2つの障壁に直面している。本稿では,FLAMEのリッチな表現空間とSMPL-X本体を結合したSMPL-FXを提案する。
参考スコア（独自算出の注目度）: 56.171224102170015
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME's rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.
Abstract（参考訳）: 手話の制作には手話以上のものが必要だ。非手動的特徴として, 口づけ, まぶたの上昇, 視線, 頭部運動は文法的に義務付けられ, 手動調音器のみでは回復できない。標準的なボディモデルは、これらの記述を符号化するには低次元すぎる顔空間を提供し、よりリッチな表現を採用すると、標準的な離散トークン化はコードブックの崩壊に悩まされ、ほとんどの表現空間は到達不能となる。本稿では,FLAMEのリッチな表現空間とSMPL-X本体を結合したSMPL-FXを提案する。 M3Tは、この多モーダル運動語彙に基づいて訓練された自己回帰変換器であり、意味的に接地された埋め込みを促進する補助的な翻訳目的を持つ。 3つの標準ベンチマーク (How2Sign, CSL-Daily, Phoenix14T) M3T は最先端の手話の品質を達成し、NMFs-CSLでは、手動でのみ区別できるが、最も近いポーズベースラインに対して 49.0% の精度で58.3% に達する。

論文の概要: M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production

関連論文リスト