Fugu-MT 論文翻訳(概要): Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

論文の概要: Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

arxiv url: http://arxiv.org/abs/2603.05963v1
Date: Fri, 06 Mar 2026 06:54:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.196858
Title: Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models
Title（参考訳）: Skeleton-to- Image Encoding:Enabling Skeleton Representation Learning via Vision-Pretrained Models
Authors: Siyuan Yang, Jun Liu, Hao Cheng, Chong Wang, Shijian Lu, Hedvig Kjellstrom, Weisi Lin, Alex C. Kot,
Abstract要約: 骨格配列を画像ライクなデータに変換する新しい表現であるSkeleton-to-Imageを紹介する。この符号化により、自己教師付き骨格表現学習のための強力な視覚事前学習モデルが利用可能となる。
参考スコア（独自算出の注目度）: 110.11712022072975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.
Abstract（参考訳）: 大規模事前学習型視覚モデルの最近の進歩は、クロスモーダルやマルチモーダルシナリオを含む、幅広い下流タスクにおいて印象的な能力を示している。しかし,データ形式に根本的な違いがあるため,人間の骨格データへの直接的適用は依然として困難である。さらに、大規模スケルトンデータセットの不足と、新たなモデルブランチを導入することなく、スケルトンデータをマルチモーダルな動作認識に組み込む必要性は、大きな研究機会をもたらす。これらの課題に対処するために,骨格列を画像のようなデータに変換する新しい表現であるSkeleton-to-Image Encoding (S2I)を導入する。この符号化により、自己教師付き骨格表現学習に強力な視覚事前学習モデルの使用が可能となり、豊富な視覚領域の知識を骨格解析に効果的に転送することができる。既存の骨格法はしばしば、特定の均一な骨格形式に合わせたモデルを設計するが、様々なデータソースから自然に生じる構造的不均一性を見落としている。対照的に、当社のS2I表現は、不均一な骨格データに自然に適合する統合されたイメージライクなフォーマットを提供する。 NTU-60, NTU-120, PKU-MMDの大規模実験により, 自己教師型骨格表現学習法の有効性と一般化が実証された。

論文の概要: Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

関連論文リスト