Fugu-MT 論文翻訳(概要): A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

論文の概要: A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

arxiv url: http://arxiv.org/abs/2510.18705v2
Date: Thu, 23 Oct 2025 02:35:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.748152
Title: A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Title（参考訳）: 行動認識のための変圧器からの明示的な動作情報マイニングのルネサンス
Authors: Peiqin Zhuang, Lei Bai, Yichao Wu, Ding Liang, Luping Zhou, Yali Wang, Wanli Ouyang,
Abstract要約: 行動認識は、文脈集約能力のおかげで、トランスフォーマーベースの手法によって支配されている。本稿では,これらの効果的な動作モデリング特性を,統一的かつ適切な方法で既存の変圧器に統合することを提案する。提案手法は,既存の最先端手法,特に動きに敏感なデータセットよりも優れている。
参考スコア（独自算出の注目度）: 87.12969639957441
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from the query-based neighboring area in the next frame in a sliding-window manner. Then, the constructed affinity matrix is used to aggregate contextual information for appearance modeling and is converted into motion features for motion modeling as well. We validate the motion modeling capacities of our method on four widely-used datasets, and our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets, i.e., Something-Something V1 & V2. Our project is available at https://github.com/PeiqinZhuang/EMIM .
Abstract（参考訳）: 近年,時空間アグリゲーション能力によって,行動認識はトランスフォーマーに基づく手法によって支配されている。しかし、シーン関連データセットでは大きな進歩があったが、精巧なモーションモデリング設計が欠如しているため、動きに敏感なデータセットではうまく機能しない。一方,従来の行動認識において広く用いられているコスト容積は,自己注意で定義された親和性行列と非常によく似ているが,強力な動きモデリング能力を備えている。そこで我々は,これらの効果的な動作モデリング特性を,既存の変圧器に統一的かつ適切な方法で統合する手法を提案し,EMIM (Explicit Motion Information Mining Module) を提案する。 EMIMでは,次のフレームのクエリベースの隣接領域から,キー候補トークンの集合をスライディングウィンドウでサンプリングする,コストボリュームスタイルで望ましい親和性行列を構築することを提案する。そして、構築された親和性行列を用いて、外観モデリングの文脈情報を集約し、動きモデリングの運動特徴にも変換する。提案手法の動作モデリング能力は,4つの広く使用されているデータセットに対して検証し,既存の最先端手法,特に動作に敏感なデータセット,すなわちSomething V1およびV2に対して,より優れた性能を示す。私たちのプロジェクトはhttps://github.com/PeiqinZhuang/EMIMで利用可能です。

論文の概要: A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

関連論文リスト