Fugu-MT 論文翻訳(概要): BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

論文の概要: BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

arxiv url: http://arxiv.org/abs/2309.15785v2
Date: Thu, 27 Jun 2024 12:05:48 GMT
ステータス: 翻訳完了
システム内更新日: 2024-06-28 20:16:23.401363
Title: BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Title（参考訳）: BT-Adapter:ビデオの会話はビデオのインストラクションチューニングなしでは不可能
Authors: Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li,
Abstract要約: BT-Adapterは、画像言語で事前訓練されたモデルをビデオドメインに拡張する新しい方法である。一度トレーニングされたばかりのBT-Adapterは、すべての画像会話モデルにシームレスに統合できる。 BT-Adapterは、(1)最先端のゼロショットの結果を、何千時間も少ないGPU時間で、様々なビデオタスクで達成する。
参考スコア（独自算出の注目度）: 75.50620335266682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.
Abstract（参考訳）: 近年のLarge Language Models (LLM) の進歩は、画像言語対話エージェントの様々な進歩を加速させ、また、熟練したビデオベースの対話システムの構築方法はまだ検討中である。 LLMと視覚バックボーンの広範なスケールを考慮すると、ビデオの理解とフィードバックの提供に不可欠な効果的な時間的モデリングを容易にするために、最小限のGPUメモリが残されている。そこで本研究では,画像言語事前学習モデルをビデオ領域に拡張する新しい手法であるBnching Temporal Adapter (BT-Adapter)を提案する。具体的には、BT-Adapterは、事前訓練された視覚エンコーダと共に、プラグアンドユース・テンポラル・モデリング・ブランチとして機能し、背骨を凍結させながら調整される。一度トレーニングされたばかりのBT-Adapterは、このバージョンのCLIPを使用して、すべての画像会話モデルにシームレスに統合することができ、ビデオインストラクションを必要とせずにビデオ会話を可能にする。さらに,BT-Adapterのトレーニングタスクをカスタマイズした,枝内におけるユニークな非対称なトークンマスキング戦略を開発し,より高速な収束とより良い結果を得る。 BT-Adapterのおかげで、過剰なGPUコストを発生させることなく、強力なビデオ理解機能を備えた既存のマルチモーダル対話モデルを強化することができます。 BT-Adapterは、ベルとホイッスルを使わずに、(1)最先端のゼロショットの結果を、数千時間少ないGPU時間で、様々なビデオタスクで達成する。 2) 現在のビデオチャットボットよりも、ビデオ指導のチューニングを伴わないパフォーマンスが向上した。 3) 映像指導チューニングによるビデオチャットの最先端結果, 従来のSOTAよりも大差で優れていた。

論文の概要: BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

関連論文リスト