Fugu-MT 論文翻訳(概要): $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

論文の概要: $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

arxiv url: http://arxiv.org/abs/2604.26461v1
Date: Wed, 29 Apr 2026 09:17:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.32844
Title: $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding
Title（参考訳）: $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding
Authors: Lingjie Zeng, Hailun Zhang, Xiwen Wang, Qijun Zhao,
Abstract要約: ビデオ理解のためのPKS(Parallel Kinematic Selective State Space Scanners)を提案する。我々の手法は、純粋な計算sよりも約10倍低いトレーニングを実現するために、わずか20ドルで収束する。
参考スコア（独自算出の注目度）: 16.337339443094866
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS$^4$). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS$^4$ module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS$^4$ achieves state-of-the-art performance. Remarkably, our method converges in merely $20$ epochs, achieving approximately $10\times$ lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding.
Abstract（参考訳）: テンポラルモデリングはビデオ理解における根本的な課題であり、特にシーケンス長の尺度である。従来のビデオモデルは、高密度な時空間的注意を頼りにしており、長いビデオの2次計算コストに悩まされている。これらのコストを回避するために、近年のアプローチでは、アダプタのようなパラメータ効率の良い細調整(PEFT)手法を用いてビデオのイメージモデルを適用する。しかし、これらのモジュールを深く挿入すると、バックプロパゲーション時のアクティベーションメモリのオーバーヘッドが禁じられる。最近の効率的な状態空間モデル(SSM)は線形複雑性を導入しているが、それらは2次元空間関係を乱し、空間認識を回復するためにマスク付き事前学習に依存している。これらの制限を克服するため、Parallel Kinematic Selective State Space Scanners (PKS$^4$)を提案する。我々は空間意味論のための標準的な2次元視覚バックボーンを保持し、時間的注意と多層アダプタを避けるために、線形複雑時間走査を備えた1つのプラグ&プレイPKS$^4$モジュールを挿入する。まず, フレーム間相関と差分により局所変位と運動境界を捕捉するキネマティックプリエンコーダを用いて, キネマティックプリエンコーダを抽出する。これらのプリエントは、根底にあるキネマティック状態を追跡するために線形複雑性SSMを駆動し、各ステップで更新速度と読み書き戦略を適応的に調整する。グローバルスキャンの代わりに,各空間位置の時間次元に沿って並列スキャナを配置し,空間構造を保ちながらオーバーヘッドを低減した。空間重度および時間重度動作認識ベンチマークの実験は、PKS$^4$が最先端の性能を達成することを示す。興味深いことに、我々の手法は20ドルというエポックに収束し、純粋なビデオSSMよりも約10ドル以下のトレーニング計算を達成し、効率的なビデオ理解のための新しいパラダイムを確立している。

論文の概要: $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

関連論文リスト