Fugu-MT 論文翻訳(概要): Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

論文の概要: Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

arxiv url: http://arxiv.org/abs/2310.05010v1
Date: Sun, 8 Oct 2023 04:46:43 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-12 13:36:05.405064
Title: Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Title（参考訳）: アーキテクチャ、最適化、データを改善するオープンボキャブラリなビデオクリップモデルの構築
Authors: Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang
Abstract要約: 本稿では,CLIPを強力なゼロショットビデオ分類器に適応させるフレームワークであるOpen-VCLIP++を提案する。我々は,Open-VCLIP++のトレーニングが,履歴データゼロで連続的な学習に欠かせないことを実証した。提案手法は,広く使用されている3つの行動認識データセットを用いて評価する。
参考スコア（独自算出の注目度）: 102.0069667710562
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at https://github.com/wengzejia1/Open-VCLIP.
Abstract（参考訳）: Contrastive Language-Image Pretraining (CLIP) によるゼロショット画像認識における大きな成果にもかかわらず、ゼロショットビデオ認識の可能性を探究する努力は限られている。本稿では、CLIPを強力なゼロショットビデオ分類器に適用し、テスト中に新しいアクションやイベントを識別できる、シンプルで効果的なフレームワークであるOpen-VCLIP++を提案する。 open-vclip++はクリップを最小限に修正し、ビデオ内の空間的-時間的関係をキャプチャする。我々は,Open-VCLIP++のトレーニングが,履歴データゼロで連続的な学習に欠かせないことを正式に証明した。この問題に対処するために、トレーニングとテストの両方においてウェイト補間の利点を活用する手法であるInterpolated Weight Optimizationを導入する。さらに,大規模言語モデルを構築し,詳細な映像記述を作成する。これらの詳細な説明はさらにビデオ機能と一致しており、CLIPをビデオドメインに転送するのに役立つ。提案手法は,様々なゼロショット評価プロトコルに従って,広く使用されている3つの行動認識データセット上で評価される。その結果,本手法は既存の最先端技術を大幅に超えていることがわかった。具体的には、UCF、HMDB、Kinetics-600データセットにおいて、ゼロショット精度スコアが88.1%、58.7%、81.2%に達し、最も優れた代替手法である8.5%、8.2%、12.3%を上回った。また,msr-vttビデオテキスト検索データセットのアプローチを評価し,他の手法に比べて微調整データを大幅に削減しつつ,競合するテキスト間およびテキスト間検索性能を提供する。コードはhttps://github.com/wengzejia1/Open-VCLIPで公開されている。

論文の概要: Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

関連論文リスト