Fugu-MT 論文翻訳(概要): Towards Universal Skeleton-Based Action Recognition

論文の概要: Towards Universal Skeleton-Based Action Recognition

arxiv url: http://arxiv.org/abs/2604.17013v1
Date: Sat, 18 Apr 2026 14:50:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.287994
Title: Towards Universal Skeleton-Based Action Recognition
Title（参考訳）: ユニバーサル骨格に基づく行動認識に向けて
Authors: Jidong Kuang, Hongsong Wang, Jie Gui,
Abstract要約: 本研究は,開語彙を用いたヘテロジニアス骨格に基づく行動認識の問題を研究する。本稿では, 骨格の統一表現, 骨格の移動エンコーダ, 多粒な動きテキストアライメントの3つの重要な構成要素からなるトランスフォーマーモデルを提案する。ヘテロジニアス骨格データを用いた一般的なベンチマーク実験では,提案手法の有効性と性能が実証された。
参考スコア（独自算出の注目度）: 26.447920160010515
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the development of robotics, skeleton-based action recognition has become increasingly important, as human-robot interaction requires understanding the actions of humans and humanoid robots. Due to different sources of human skeletons and structures of humanoid robots, skeleton data naturally exhibit heterogeneity. However, previous works overlook the data heterogeneity of skeletons and solely construct models using homogeneous skeletons. Moreover, open-vocabulary action recognition is also essential for real-world applications. To this end, this work studies the challenging problem of heterogeneous skeleton-based action recognition with open vocabularies. We construct a large-scale Heterogeneous Open-Vocabulary (HOV) Skeleton dataset by integrating and refining multiple representative large-scale skeleton-based action datasets. To address universal skeleton-based action recognition, we propose a Transformer-based model that comprises three key components: unified skeleton representation, motion encoder for skeletons, and multi-grained motion-text alignment. The motion encoder feeds multi-modal skeleton embeddings into a two-stream Transformer-based encoder to learn spatio-temporal action representations, which are then mapped to a semantic space to align with text embeddings. Multi-grained motion-text alignment incorporates contrastive learning at three levels: global instance alignment, stream-specific alignment, and fine-grained alignment. Extensive experiments on popular benchmarks with heterogeneous skeleton data demonstrate both the effectiveness and the generalization ability of the proposed method. Code is available at https://github.com/jidongkuang/Universal-Skeleton.
Abstract（参考訳）: ロボット工学の発展に伴い、人間とロボットの相互作用には人間とヒューマノイドロボットの動作を理解する必要があるため、骨格に基づく行動認識の重要性が高まっている。ヒトの骨格と人型ロボットの構造の異なるため、骨格データは自然に異質性を示す。しかし、以前の研究は骨格のデータ不均一性を見落とし、同質骨格を用いたモデルのみを構築した。さらに、実世界のアプリケーションにはオープン語彙のアクション認識が不可欠である。そこで本研究では,開語彙を用いたヘテロジニアス骨格に基づく行動認識の課題について検討する。複数の代表的大規模骨格に基づく行動データセットの統合と精錬により、大規模な不均一な開語彙(HOV)スケルトンデータセットを構築した。普遍的な骨格に基づく行動認識を実現するために,骨格の統一表現,骨格の動作エンコーダ,多粒な動きテキストアライメントという3つの重要な要素からなるトランスフォーマーモデルを提案する。モーションエンコーダは、2ストリームのTransformerベースのエンコーダにマルチモーダルスケルトンを埋め込んで、時空間の表現を学習する。多粒な動きテキストアライメントは、大域的なインスタンスアライメント、ストリーム固有のアライメント、微粒なアライメントの3つのレベルにおいて、対照的な学習を取り入れている。ヘテロジニアス骨格データを用いた一般的なベンチマーク実験により,提案手法の有効性と一般化能力の両立を実証した。コードはhttps://github.com/jidongkuang/Universal-Skeletonで入手できる。

論文の概要: Towards Universal Skeleton-Based Action Recognition

関連論文リスト