Fugu-MT 論文翻訳(概要): RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

論文の概要: RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

arxiv url: http://arxiv.org/abs/2510.16444v1
Date: Sat, 18 Oct 2025 10:41:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:38.998579
Title: RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba
Title（参考訳）: RefAtomNet++:セマンティック検索に基づくマルチトラジェクトリマンバを用いた原子ビデオ行動認識の改善
Authors: Kunyu Peng, Di Wen, Jia Fu, Jiamin Wu, Kailun Yang, Junwei Zheng, Ruiping Liu, Yufan Chen, Yuqian Fu, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen,
Abstract要約: RefAVA++は290万フレームと75.1kの注釈付き人で構成される。 RefAtomNet++は、多階層的なセマンティックアラインなクロスアテンションメカニズムを通じて、クロスモーダルトークンアグリゲーションを前進させる。実験によると、RefAtomNet++は新しい最先端の結果を確立している。
参考スコア（独自算出の注目度）: 86.47790050206306
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.
Abstract（参考訳）: Referring Atomic Video Action Recognition (RAVAR) は、自然言語記述に基づく特定の興味ある人物の微細で原子レベルの行動を認識することを目的としている。従来の行動認識と検出タスクとは違い、RAVARは言語誘導型行動理解に重点を置いており、複雑な多人数シナリオにおける対話的な人間の行動分析に特に重要である。本研究では、以前に紹介したRefAVAデータセットをRefAVA++に拡張する。我々は、このデータセットを、アトミックアクションローカライゼーション、ビデオ質問応答、テキストビデオ検索など、複数の関連するドメインのベースラインと、以前のモデルであるRefAtomNetを用いてベンチマークする。 RefAtomNetは、エージェントの注意を取り入れて健全な特徴を強調しているが、クロスモーダル情報の整列と検索能力は依然として限られており、対象人物のローカライズやきめ細かな動作の予測に最適なパフォーマンスをもたらす。上記の制限を克服するため、我々はRefAtomNet++を紹介した。RefAtomNet++は、多階層的なセマンティック・アラインなクロスアテンション機構と、部分キー、シーン属性、全体文レベルでのマルチトラックMambaモデリングを組み合わせることで、クロスモーダルなトークンアグリゲーションを促進する新しいフレームワークである。特に、スキャントラジェクトリは、部分キーワードとシーン属性レベルの両方のタイムステップ毎に、最も近い視覚的空間トークンを動的に選択することによって構成される。さらに,多階層型セマンティック・アライメント・アテンション・ストラテジーを設計し,複数のセマンティック・階層にまたがる空間的および時間的トークンのより効率的な集約を可能にする。実験によると、RefAtomNet++は新しい最先端の結果を確立している。データセットとコードはhttps://github.com/KPeng9510/refAVA2でリリースされる。

論文の概要: RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

関連論文リスト