Fugu-MT 論文翻訳(概要): Zero-Shot Temporal Action Localization Through Textual Guidance

論文の概要: Zero-Shot Temporal Action Localization Through Textual Guidance

arxiv url: http://arxiv.org/abs/2605.22201v1
Date: Thu, 21 May 2026 09:05:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.179529
Title: Zero-Shot Temporal Action Localization Through Textual Guidance
Title（参考訳）: テキスト誘導によるゼロショット時間行動定位
Authors: Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci,
Abstract要約: 時間的アクションローカライゼーション(ZS-TAL)は、未トリミングビデオにおけるアクションの分類とローカライゼーションである。トレーニングデータからの監督の欠如を補う新たなアプローチとして,ビデオ中のアクションのより詳細なローカライズのためのテキストガイダンス(TEGU)を提案する。
参考スコア（独自算出の注目度）: 57.40476559895395
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training
Abstract（参考訳）: Zero-shot temporal Action Localization (ZS-TAL) は、未トリミングビデオにおけるアクションの分類とローカライズによって構成される。既存の作業はビジョンと言語モデル(VLM)を使用しており、強力なゼロショット転送機能を活用している。しかし、これらのモデルは、きめ細かいアクション分類で明らかな課題に直面しており、アクションの存在と欠如を区別するために直接使用するのは難しい。 ZS-TALの現在のほとんどの手法は、注釈付きデータを必要とする大規模ビデオデータセットのモデルをトレーニングすることでこれらの課題に対処する。近年、ラベル付きデータの使用を廃止するアプローチが代替手段として浮上している。そこで本研究では,大規模な言語モデルから得られたリッチテキスト情報とキャプションから抽出した構造化テキスト情報を利用して,トレーニングデータからの監督の欠如を補う,ビデオ中のアクションのより詳細なローカライズのためのテキストガイダンス(TEGU)を提案する。この追加の言語コンテキストは、ビデオ内のきめ細かいアクションの相違についてよりリッチな手がかりを提供することによって、きめ細かな識別を改善することができる。 THUMOS14とActivityNet-v1.3データセットを用いて実験を行い,提案手法の有効性を検証する。以上の結果から,TAGUはリッチテキスト情報を利用して行動局所化を改善することにより,訓練を伴わない最先端のZS-TALアプローチより優れることがわかった。

論文の概要: Zero-Shot Temporal Action Localization Through Textual Guidance

関連論文リスト