Fugu-MT 論文翻訳(概要): Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies

論文の概要: Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies

arxiv url: http://arxiv.org/abs/2605.24302v1
Date: Sat, 23 May 2026 00:13:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:17.859038
Title: Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies
Title（参考訳）: マンバを用いたエゴセントリックビデオにおけるクロスモーダルな行動認識:CRSトーケン融合戦略によるRGBとハンドスケルトンストリームの統合
Authors: Juan Ignacio Bustos Gorostegui, Maria Elena Buemi,
Abstract要約: 本稿では,RGBビデオと時間的手関節データを組み合わせたクロスモーダルアーキテクチャを提案する。私たちのアーキテクチャは,視覚的特徴抽出のためのVideoMambaモジュール,Mambaブロックのスタック上に構築されたスケルトンエンコーダ,両モジュールを単一の表現に統合する融合モジュールの3つのコンポーネントで構成されている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Egocentric action recognition is a challenging task due to erratic camera motion, frequent hand occlusion, and the difficulty of maintaining consistent visual representations over time. In this work, we propose a cross-modal architecture that combines RGB video and temporal hand skeleton data within a unified Mamba-based framework, exploiting the linear time complexity of State Space Models (SSMs). Our architecture consists of three components: a VideoMamba module for visual feature extraction, a skeleton encoder built on a stack of Mamba blocks, and a fusion module that integrates both modalities into a single representation. A central contribution of this work is the design and evaluation of four Class (CLS) token mixing strategies for multimodal fusion: Naive, Average, Weighted and Context-based. These strategies differ in how the pretrained unimodal CLS tokens, which role is to act as information sinks concentrating learned representations, are leveraged to initialize the mixed CLS token used for final classification. We evaluate all strategies on the H2O dataset. Experimental results show that the Average strategy achieves the best performance, yielding gains of over 10% Top-1 accuracy in the Tiny configuration and 2% in the Small configuration over the VideoMamba baseline.
Abstract（参考訳）: エゴセントリックな行動認識は、不規則なカメラの動き、頻繁な手の閉塞、時間の経過とともに一貫した視覚的表現を維持することの難しさによって難しい課題である。本研究では,状態空間モデル(SSM)の線形時間的複雑性を利用して,RGBビデオと時間的手骨格データを統一したMambaベースのフレームワークで組み合わせたクロスモーダルアーキテクチャを提案する。私たちのアーキテクチャは,視覚的特徴抽出のためのVideoMambaモジュール,Mambaブロックのスタック上に構築されたスケルトンエンコーダ,両モジュールを単一の表現に統合する融合モジュールの3つのコンポーネントで構成されている。この研究の中心的な貢献は、多モード核融合のための4つのクラス(CLS)トークン混合戦略(Naive, Average, Weighted, Context-based)の設計と評価である。これらの戦略は、学習表現を集中させる情報シンクとして機能し、最終分類に使用される混合CLSトークンを初期化するために活用される、事前訓練されたユニモーダルCRSトークンとの違いである。 H2Oデータセット上のすべての戦略を評価する。実験結果から,Tiny構成では10%以上のTop-1精度,VideoMambaベースラインでは2%以上のSmall構成が得られた。

論文の概要: Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies

関連論文リスト