Fugu-MT 論文翻訳(概要): One For All: Video Conversation is Feasible Without Video Instruction Tuning

論文の概要: One For All: Video Conversation is Feasible Without Video Instruction Tuning

arxiv url: http://arxiv.org/abs/2309.15785v1
Date: Wed, 27 Sep 2023 16:58:35 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-28 12:43:25.917808
Title: One For All: Video Conversation is Feasible Without Video Instruction Tuning
Title（参考訳）: ビデオの会話は、ビデオのインストラクションがなくてもできる
Authors: Ruyang Liu and Chen Li and Yixiao Ge and Ying Shan and Thomas H. Li and Ge Li
Abstract要約: BT-Adapterは、画像言語で事前訓練されたモデルをビデオドメインに拡張する新しい方法である。一度トレーニングされたばかりのBT-Adapterは、すべての画像会話モデルにシームレスに統合できる。 BT-Adapterは、(1)最先端のゼロショットの結果を、何千時間も少ないGPU時間で、様々なビデオタスクで達成する。
参考スコア（独自算出の注目度）: 80.00756768030534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.
Abstract（参考訳）: 近年のLarge Language Models (LLM) の進歩により、画像言語対話エージェントの進歩が加速し、ビデオベース対話システムの構築方法がまだ検討中である。 LLMと視覚バックボーンの広範なスケールを考慮すると、ビデオの理解とフィードバックの提供に不可欠な効果的な時間的モデリングを容易にするために、最小限のGPUメモリが残されている。そこで本研究では,画像言語事前学習モデルをビデオ領域に拡張する新しい手法であるBnching Temporal Adapter (BT-Adapter)を提案する。具体的には、BT-Adapterは、事前訓練された視覚エンコーダと共に、プラグアンドユース・テンポラルモデリングブランチとして機能する。一度事前トレーニングすると、bt-adapterは、このバージョンのクリップを使って、すべての画像会話モデルにシームレスに統合できる。さらに,BT-Adapterのトレーニングタスクを調整したブランチ内で独自の非対称なトークンマスキング戦略を開発し,より高速な収束とより良い結果を得る。 BT-Adapterのおかげで、過剰なGPUコストを発生させることなく、強力なビデオ理解機能を備えた既存のマルチモーダル対話モデルを強化することができます。 BT-Adapterは、ベルとホイッスルを使わずに、(1)最先端のゼロショットの結果を、数千時間少ないGPU時間で、様々なビデオタスクで達成する。 2)ビデオインストラクションのチューニングが不要な現在のビデオチャットボットよりもパフォーマンスが良い。 3) 映像指導チューニングによるビデオチャットの最先端結果, 従来のSOTAよりも大差で優れていた。

論文の概要: One For All: Video Conversation is Feasible Without Video Instruction Tuning

関連論文リスト