Fugu-MT 論文翻訳(概要): Foundation Model for Skeleton-Based Human Action Understanding

論文の概要: Foundation Model for Skeleton-Based Human Action Understanding

arxiv url: http://arxiv.org/abs/2508.12586v1
Date: Mon, 18 Aug 2025 02:42:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.948904
Title: Foundation Model for Skeleton-Based Human Action Understanding
Title（参考訳）: 骨格に基づく人間行動理解のための基礎モデル
Authors: Hongsong Wang, Wanjiang Weng, Junbo Wang, Fang Zhao, Guo-Sen Xie, Xin Geng, Liang Wang,
Abstract要約: 本稿では,統一骨格に基づくDense Representation Learningフレームワークを提案する。 USDRLはトランスフォーマーベースのDense Spatio-Temporal (DSTE)、Multi-Grained Feature Deorrelation (MG-FD)、Multi-Perspective Consistency Training (MPCT)で構成されている。
参考スコア（独自算出の注目度）: 56.89025287217221
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. \RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.
Abstract（参考訳）: 人間の行動理解は、知的運動知覚の分野における基礎的な柱として機能する。骨格は人間のモデリングのモダリティとデバイスに依存しない表現として機能し、骨格に基づく行動理解はヒューマノイドロボットの制御と相互作用に潜在的に応用できる。しかし、既存の作業は様々なアクション理解タスクを扱うのに必要なスケーラビリティと一般化を欠いていることが多い。幅広い行動理解タスクに適応できる骨格基盤モデルは存在しない。本稿では,骨格に基づく人間の行動理解の基礎モデルとして機能する,統一骨格に基づくDense Representation Learning(USDRL)フレームワークを提案する。 USDRLはトランスフォーマーベースのDense Spatio-Temporal Encoder (DSTE)、Multi-Grained Feature Deorrelation (MG-FD)、Multi-Perspective Consistency Training (MPCT)で構成されている。 DSTEモジュールは2つの並列ストリームを採用し、時間的動的および空間的構造の特徴を学習する。 MG-FDモジュールは、時間領域、空間領域、インスタンス領域をまたいで特徴デコレーションを行い、次元の冗長性を低減し、情報抽出を強化する。 MPCTモジュールはマルチビューとマルチモーダルな自己教師型一貫性トレーニングの両方を採用している。前者は高レベルのセマンティクスの学習を強化し、低レベルの不一致の影響を緩和し、後者は情報的マルチモーダルな特徴の学習を効果的に促進する。 9つの骨格に基づく行動理解タスクにまたがる25のベンチマークで、粗い予測、密集した予測、転送された予測について広範な実験を行った。我々の手法は現在の最先端手法よりも大幅に優れています。この研究が骨格に基づく行動理解の研究範囲を広げ、より密集した予測タスクにもっと注意を向けることを願っている。

論文の概要: Foundation Model for Skeleton-Based Human Action Understanding

関連論文リスト