Fugu-MT 論文翻訳(概要): Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

論文の概要: Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

arxiv url: http://arxiv.org/abs/2301.07868v1
Date: Thu, 19 Jan 2023 03:42:56 GMT
ステータス: 翻訳完了
システム内更新日: 2023-01-20 15:44:34.312072
Title: Multimodal Video Adapter for Parameter Efficient Video Text Retrieval
Title（参考訳）: パラメータ効率の良いビデオテキスト検索のためのマルチモーダルビデオアダプタ
Authors: Bowen Zhang, Xiaojie Jin, Weibo Gong, Kai Xu, Zhao Zhang, Peng Wang, Xiaohui Shen, Jiashi Feng
Abstract要約: 最先端のビデオテキスト検索手法は通常、訓練済みのモデル(例えばCLIP)を特定のデータセットで完全に微調整する。本稿では,事前学習モデルからパラメータ効率のよいVTRを実現するための先行研究について述べる。本稿では,Multimodal Video Adapter (MV-Adapter) と呼ばれる新しい手法を提案する。
参考スコア（独自算出の注目度）: 81.88648509168962
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art video-text retrieval (VTR) methods usually fully fine-tune the pre-trained model (e.g. CLIP) on specific datasets, which may suffer from substantial storage costs in practical applications since a separate model per task needs to be stored. To overcome this issue, we present the premier work on performing parameter-efficient VTR from the pre-trained model, i.e., only a small number of parameters are tunable while freezing the backbone. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter adopts bottleneck structures in both video and text branches and introduces two novel components. The first is a Temporal Adaptation Module employed in the video branch to inject global and local temporal contexts. We also learn weights calibrations to adapt to the dynamic variations across frames. The second is a Cross-Modal Interaction Module that generates weights for video/text branches through a shared parameter space, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve on-par or better performance than standard fine-tuning with negligible parameters overhead. Notably, on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet), MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins. Codes will be released.
Abstract（参考訳）: 最先端のビデオテキスト検索(vtr:state-of-the-art video-text retrieval)メソッドは通常、特定のデータセットで事前訓練されたモデル(例えばクリップ)を完全に微調整する。そこで本研究では,事前学習したモデルからパラメータ効率のよいvtrを行うための最重要課題として,バックボーンを凍結しながら少数のパラメータしか調整できないことを提案する。本研究では,事前学習されたクリップの知識を画像テキストからビデオテキストに効率的に転送するマルチモーダルビデオアダプタ(mv-adapter)を提案する。具体的には、MV-Adapterはビデオとテキストの両方でボトルネック構造を採用し、2つの新しいコンポーネントを導入している。ひとつは、ビデオブランチで採用されている時間適応モジュールで、グローバルとローカルの時間的コンテキストを注入する。フレーム間の動的変動に対応するために、ウェイトキャリブレーションも学習します。 2つ目はクロスモーダルインタラクションモジュールで、共有パラメータ空間を通じてビデオ/テキストブランチの重みを生成し、モダリティ間の整合性を改善する。上記のイノベーションのおかげで、MV-Adapterは、無視できるパラメーターのオーバーヘッドで標準の微調整よりも高いパフォーマンスを達成することができる。特に、広く使われている5つのVTRベンチマーク(MSR-VTT, MSVD, LSMDC, DiDemo, ActivityNet)では、MV-AdapterはV2T/T2Vタスクにおいて、大きなマージンを持つ様々な競合メソッドよりも一貫して優れている。コードはリリースされる。

論文の概要: Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

関連論文リスト