Fugu-MT 論文翻訳(概要): STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

論文の概要: STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

arxiv url: http://arxiv.org/abs/2605.13202v1
Date: Wed, 13 May 2026 08:54:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.929176
Title: STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition
Title（参考訳）: STAR:Few-Shot行動認識のためのセマンティック・テンポラル適応表現学習
Authors: Hongli Liu, Yu Wang, Shengjie Zhao,
Abstract要約: Few-shot Action Recognition (FSAR) は、少数の注釈付きサンプルから新しいアクションカテゴリを一般化するモデルを必要とする。視覚言語モデルの進歩にもかかわらず、既存のアプローチは意味的時間的ミスアライメントに悩まされている。本稿では,セマンティック・テンポラル適応表現学習(STAR)を提案し,セマンティック・アライメント・コンポーネントとテンポラル・アライメント・コンポーネントからなる統合フレームワークを提案する。
参考スコア（独自算出の注目度）: 23.546777614096424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR-main.
Abstract（参考訳）: Few-shot Action Recognition (FSAR) は、少数の注釈付きサンプルから新しいアクションカテゴリを一般化するモデルを必要とする。視覚言語モデルの進歩にもかかわらず、既存のアプローチは意味的・時間的ミスアライメントに悩まされており、静的なテキスト的プロンプトは、シーケンス間でスパースに現れる決定的な視覚的キューをキャプチャできない。これらの課題に対処するために,意味調整コンポーネントと時間認識コンポーネントからなる統合フレームワークであるセマンティック・テンポラル適応表現学習(STAR)を提案し,意味的・時間的ギャップを効果的にブリッジし,Mambaのシーケンスモデリング能力をFSARに転送する。セマンティックアライメントモジュールはテンポラルセマンティックアテンション(TSA)機構を導入し、フレームレベルのクロスモーダルアライメントをテキストキューと行い、セマンティックアライメントの微粒化を保証する。時間認識モジュールにはセマンティック・テンポラル・プロトタイプ・リファイナ(STPR)が組み込まれており、セマンティック・テンポラル・プロトタイプ・リファイナ(STPR)は、意味誘導されたマンバブロックと多周波の時間的サンプリングと双方向の状態空間の洗練を統合し、識別的忠実度と時間的整合性を高めたセマンティック・アライメント・プロトタイプを生成する。さらに,大規模言語モデル(LLM)から派生した時間依存型クラス記述子は,長期的意味指導を提供する。 5つのFSARベンチマークの大規模な実験は、最先端の手法よりもSTARが一貫した優位性を示している。例えば、STARは1ショット設定でSSv2-FullデータセットとSSv2-Smallデータセットで最大8.1%と6.7%のゲインを獲得し、HMDB51では7.3%を獲得し、その有効性を限定的に検証している。コードはhttps://github.com/HongliLiu1/STAR-mainで公開されている。

論文の概要: STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

関連論文リスト