Fugu-MT 論文翻訳(概要): TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

論文の概要: TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

arxiv url: http://arxiv.org/abs/2604.11498v1
Date: Mon, 13 Apr 2026 14:03:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.582942
Title: TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition
Title（参考訳）: TAG-Head: プラグ・アンド・プレイファインファインな動作認識のためのタイムアラインなグラフヘッド
Authors: Imtiaz Ul Hassan, Nik Bessis, Ardhendu Behera,
Abstract要約: RGBのみを使用してFHARの標準3Dバックボーンをアップグレードする軽量なFLOグラフヘッドであるTAG-Headを紹介する。ヘッドはコンパクト(小さな/Pオーバーヘッド)で、バックボーンをまたいでプラグ&プレイし、バックボーンでエンドツーエンドを訓練する。我々は、TAG-HeadがRGBのみのモデルに新しい最先端のモデルを設定し、近年のマルチモーダルアプローチを超越していることを示す。
参考スコア（独自算出の注目度）: 4.18721311473154
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.
Abstract（参考訳）: 微粒な人間の行動認識(FHAR)は、微妙な時空間的手がかりによって視覚的に類似した行動が異なるため困難である。近年のシステムの多くは、余分なモダリティ(例えば、ポーズ、テキスト、光の流れ)で識別性を高めているが、これはアノテーションの負担と計算コストを増大させる。本稿では,標準的な3Dバックボーン(SlowFast, R(2+1)D-34, I3Dなど)をアップグレードする軽量な時空間グラフヘッドであるTAG-Headを紹介する。 RGBのみ使用したFHAR用。私たちのパイプラインはまず、学習可能な3D位置エンコーディングを備えたTransformerエンコーダをバックボーントークンに適用し、空間と時間の長距離依存関係をキャプチャします。得られた特徴はグラフによって洗練される。 (i)フレーム内の微妙な外観の違いを解決するために、フレーム内エッジを完全接続し、 (II) フレーム間の同じ空間的位置における特徴を接続する時間的時間的エッジは、過度な平滑化を伴わず、動きキューを安定化させる。ヘッドはコンパクト(小さなパラメータ/FLOPオーバーヘッド)、バックボーン間のプラグアンドプレイ、バックボーンでトレーニングされたエンドツーエンドである。 FineGym (Gym99 と Gym288) と HAA500 の広範な評価は、TAG-Head が RGB のみのモデルの間で新しい最先端の技術を設定し、特権情報に依存する最近のマルチモーダルアプローチ (ビデオ + ポーズ + テキスト) を超越していることを示している。アブレーションは、Transformerとグラフトポロジのコントリビューションを混乱させ、複雑性解析によってレイテンシーが低いことを確認する。 TAG-Headは、大域的コンテキストと高分解能空間相互作用と、スリムで構成可能なグラフヘッド内の低分散時間連続性とを明示的に結合することでFHARを前進させる。設計の単純さにより、RGBのみのセンサーが好まれる実用的なシステムにおいて、より重いモデルやマルチモーダルモデルに関連するパフォーマンス向上を実現することができる。コードはGitHubでリリースされる。

論文の概要: TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

関連論文リスト