Fugu-MT 論文翻訳(概要): Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

論文の概要: Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

arxiv url: http://arxiv.org/abs/2603.10648v2
Date: Thu, 12 Mar 2026 04:01:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.475071
Title: Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning
Title（参考訳）: Decoder-free Masked Modeling for Efficient Skeleton Representation Learning
Authors: Jeonghyeok Do, Yun Chen, Geunhyuk Youk, Munchurl Kim,
Abstract要約: 骨格に基づく行動表現学習は、コントラスト学習(CL)からマスケッドオートエンコーダ(MAE)へと進化した本稿では,共有エンコーダによるコントラスト学習とマスクモデリングを調和させる新しい統合フレームワークであるSLiMを提案する。我々は、SLiMが、すべてのダウンストリームプロトコルにおける最先端のパフォーマンスを一貫して達成していることを示します。
参考スコア（独自算出の注目度）: 28.87004127483584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.
Abstract（参考訳）: 骨格に基づく行動表現学習の展望は、Contrastive Learning (CL) から Masked Auto-Encoder (MAE) アーキテクチャへと進化してきた。しかし、それぞれのパラダイムは固有の制限に直面しており、CLは細粒度の局所的な詳細を見落とし、MAEは計算的に重いデコーダによって負担される。さらに、MAEは厳しい計算非対称性に悩まされており、トレーニング前の効率的なマスキングの恩恵を受けているが、下流タスクには徹底的なフルシーケンス処理が必要である。これらのボトルネックを解決するために,共有エンコーダを用いたマスキングとコントラスト学習を調和させる新しい統合フレームワークであるSLiM(Skeleton Less is More)を提案する。再構成デコーダを省略することにより、SLiMは計算冗長性を除去するだけでなく、エンコーダを補完して識別的特徴を直接キャプチャする。 SLiMは、デコーダのない代表学習のマスク付きモデリングのための最初のフレームワークである。重要なこととして,高度の骨格・時間的相関から生じる自明な再構築を防止するため,様々な時間的粒度の解剖学的整合性を確保するために,骨格・認識の増強とともに意味管マスキングを導入する。大規模な実験により、SLiMはすべてのダウンストリームプロトコルにおける最先端のパフォーマンスを一貫して達成している。特に,提案手法は,従来のMAE法に比べて推算計算コストを7.89倍に抑えることができる。

論文の概要: Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

関連論文リスト