Fugu-MT 論文翻訳(概要): ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

論文の概要: ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

arxiv url: http://arxiv.org/abs/2308.06009v1
Date: Fri, 11 Aug 2023 08:30:08 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-14 14:46:50.435789
Title: ViGT: Proposal-free Video Grounding with Learnable Token in Transformer
Title（参考訳）: ViGT: 変圧器で学習可能なトークンで提案不要のビデオグラウンド
Authors: Kun Li, Dan Guo, Meng Wang
Abstract要約: ビデオグラウンディングタスクは、リッチな言語的記述に基づく未編集ビデオにおいて、クエリされたアクションやイベントを特定することを目的としている。既存のプロポーザルフリーメソッドは、ビデオとクエリ間の複雑な相互作用に閉じ込められている。本稿では,変圧器における回帰トークン学習を行う新しい境界回帰パラダイムを提案する。
参考スコア（独自算出の注目度）: 28.227291816020646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet Captions, TACoS and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.
Abstract（参考訳）: ビデオグラウンドディング(VG)タスクは、リッチな言語記述に基づく未編集ビデオにおいて、クエリされたアクションやイベントを特定することを目的としている。既存の提案なしメソッドは、ビデオとクエリの間の複雑なインタラクションに閉じ込められ、クロスモーダル特徴の融合とvgの特徴相関を強調する。本稿では,変圧器における回帰トークン学習を行う新しい境界回帰パラダイムを提案する。特に,マルチモーダルやクロスモーダルではなく,学習可能なレグレッショントークンを用いて時間境界を予測できる,シンプルで効果的な提案不要なフレームワークであるVideo Grounding Transformer(ViGT)を提案する。 ViGTでは、学習可能なトークンの利点を次のように示す。 1) トークンはビデオやクエリとは無関係であり、元のビデオやクエリに対するデータのバイアスを回避する。 2) トークンはビデオとクエリ機能からグローバルなコンテキストアグリゲーションを同時に実行する。まず,ビデオと問合せの両方を共同機能空間に投影する共有機能エンコーダを用いて,各モダリティにおける識別的特徴を強調するために,クロスモーダルなコアテンション(すなわち,ビデオ間注目とクエリ間注目)を行った。さらに,視覚言語トランスフォーマの入力として,学習可能な回帰トークン [reg] とビデオとクエリの特徴を結合した。最後に、トークン[REG]を用いて目標モーメントと視覚的特徴を予測し、各タイムスタンプにおける前景および背景確率を制約した。提案されたViGTは、ANet Captions、TACoS、YouCookIIの3つのパブリックデータセットでうまく機能した。広範囲にわたるアブレーション研究と定性的分析により、ViGTの解釈可能性はさらに検証された。

論文の概要: ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

関連論文リスト