Fugu-MT 論文翻訳(概要): Multi-Level LVLM Guidance for Untrimmed Video Action Recognition

論文の概要: Multi-Level LVLM Guidance for Untrimmed Video Action Recognition

arxiv url: http://arxiv.org/abs/2508.17442v1
Date: Sun, 24 Aug 2025 16:45:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.526932
Title: Multi-Level LVLM Guidance for Untrimmed Video Action Recognition
Title（参考訳）: 映像行動認識のためのマルチレベルLVLM誘導
Authors: Liyang Peng, Sihan Zhu, Yunjie Guo,
Abstract要約: 本稿では,低レベルの視覚的特徴と高レベルの意味情報とのギャップを埋める新しいアーキテクチャであるイベント・テンポラライズド・ビデオ・トランスフォーマー(ECVT)を紹介する。 ActivityNet v1.3とTHUMOS14の実験では、ECVTは最先端のパフォーマンスを達成しており、平均mAPは40.5%、mAP@0.5は67.1%である。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action recognition and localization in complex, untrimmed videos remain a formidable challenge in computer vision, largely due to the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features. This paper introduces the Event-Contextualized Video Transformer (ECVT), a novel architecture that leverages the advanced semantic understanding capabilities of Large Vision-Language Models (LVLMs) to bridge this gap. ECVT employs a dual-branch design, comprising a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch. The latter utilizes an LVLM to generate multi-granularity semantic descriptions, including Global Event Prompting for macro-level narrative and Temporal Sub-event Prompting for fine-grained action details. These multi-level textual cues are integrated into the video encoder's learning process through sophisticated mechanisms such as adaptive gating for high-level semantic fusion, cross-modal attention for fine-grained feature refinement, and an event graph module for temporal context calibration. Trained end-to-end with a comprehensive loss function incorporating semantic consistency and temporal calibration terms, ECVT significantly enhances the model's ability to understand video temporal structures and event logic. Extensive experiments on ActivityNet v1.3 and THUMOS14 datasets demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14, outperforming leading baselines.
Abstract（参考訳）: 複雑でトリミングされていないビデオにおけるアクション認識とローカライゼーションは、主に細粒度のアクション、長期の時間的依存、低レベルの視覚的特徴から高レベルのセマンティック情報を取得する既存の方法の限界のために、コンピュータビジョンにおいて深刻な課題である。本稿では,このギャップを埋めるためにLVLM(Large Vision-Language Models)の高度な意味理解機能を活用する新しいアーキテクチャであるEvent-Contextualized Video Transformer(ECVT)を紹介する。 ECVTは、時空間の特徴抽出のためのビデオ符号化ブランチと、クロスモーダルガイダンスブランチからなるデュアルブランチ設計を採用している。後者はLVLMを使用して、マクロレベルの物語のためのGlobal Event Promptingや、きめ細かいアクションの詳細のためのTemporal Sub-event Promptingなど、多彩なセマンティック記述を生成する。ビデオエンコーダの学習プロセスには,高レベルなセマンティックフュージョンのための適応ゲーティング,細粒度化のためのクロスモーダルアテンション,時間的コンテキストキャリブレーションのためのイベントグラフモジュールなどの高度な機構が組み込まれている。意味的一貫性と時間的キャリブレーションを組み込んだ包括的な損失関数を備えたエンドツーエンドのトレーニングにより、ECVTは、ビデオの時間的構造とイベントロジックを理解する能力を大幅に強化する。 ActivityNet v1.3とTHUMOS14データセットの大規模な実験は、ECVTが最先端のパフォーマンスを達成し、平均mAPが40.5%、mAP@0.5が67.1%、THUMOS14がリードベースラインを上回っていることを示している。

論文の概要: Multi-Level LVLM Guidance for Untrimmed Video Action Recognition

関連論文リスト