Fugu-MT 論文翻訳(概要): State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

論文の概要: State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

arxiv url: http://arxiv.org/abs/2510.12160v1
Date: Tue, 14 Oct 2025 05:30:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.196845
Title: State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
Title（参考訳）: 映像理解のための時空間情報収集と拡散による状態空間のプロンプト
Authors: Jiahuan Zhou, Kai Zhu, Zhenyu Cui, Zichen Liu, Xu Zou, Gang Hua,
Abstract要約: 本稿では,映像理解のためのステートスペース・プロンプティング(SSP)手法を提案する。 SSPはフレーム内のプロンプトを組み合わせて、ビデオ内の重要な時間情報を集約し、伝達する。我々のSSPは、既存のSOTA法を平均2.76%上回っている。
参考スコア（独自算出の注目度）: 50.866929044215965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.
Abstract（参考訳）: 近年、事前訓練された状態空間モデルは、ビデオ内の視覚トークンを線形複雑に逐次圧縮し、高い性能を維持しながら、映像データの処理効率を向上するビデオ分類に大きな可能性を示している。下流タスクに強力な事前学習モデルを適用するために,少数の微調整パラメータのみを用いて,効率的な下流タスク適応を実現するために,迅速な学習を提案する。しかし、逐次圧縮された視覚的プロンプトトークンは、映像内の空間的・時間的文脈的情報をキャプチャできないため、映像フレーム内の空間的情報と状態圧縮モデル内のフレーム間の時間的情報と識別情報の抽出の効果的な伝播が制限される。そこで本研究では,フレーム内とフレーム間プロンプトを組み合わせて,映像中の時空間情報を集約・伝播する,映像理解のためのステートスペース・プロンプティング(SSP)手法を提案する。具体的には、フレーム内ガザリング(IFG)モジュールは、各フレーム内の空間キー情報を集約するように設計されている。さらに、IFS(Inter-Frame Spreading)モジュールは、異なるフレームに識別時空間情報を分散するように設計されている。フレーム内およびフレーム間における鍵時空間情報の適応的バランスと圧縮により,SSPは映像中の識別情報を相補的に効果的に伝播する。 4つのビデオベンチマークデータセットの大規模な実験により、SSPは既存のSOTAメソッドを平均で2.76%上回り、微調整パラメータのオーバーヘッドを低減した。

論文の概要: State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

関連論文リスト