Fugu-MT 論文翻訳(概要): InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

論文の概要: InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

arxiv url: http://arxiv.org/abs/2305.06500v2
Date: Thu, 15 Jun 2023 08:00:18 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-17 00:59:02.140647
Title: InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Title（参考訳）: InstructBLIP:インストラクションチューニングを用いた汎用視覚言語モデルを目指して
Authors: Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
Abstract要約: 我々は、事前訓練されたBLIP-2モデルに基づいて、視覚言語による指導のチューニングについて研究する。 InstructBLIPは、13のホールトアウトデータセットすべてにわたって、最先端のゼロショットパフォーマンスを実現する。私たちのモデルは、個々の下流タスクに微調整された場合、最先端のパフォーマンスももたらします。
参考スコア（独自算出の注目度）: 43.54069813039309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
Abstract（参考訳）: 大規模事前学習と指導訓練は、幅広い能力を持つ汎用言語モデルの作成に成功している。しかし,視覚入力の追加による豊富な入力分布とタスクの多様性のため,汎用視覚言語モデルの構築は困難である。視覚言語プレトレーニングは広く研究されているが、視覚言語インストラクションチューニングは未検討のままである。本稿では,事前学習したBLIP-2モデルに基づく視覚言語指導の体系的・包括的研究を行う。 26の公開データセットを収集し、さまざまなタスクと機能をカバーし、それらをインストラクションチューニング形式に変換する。さらに,与えられた命令に合わせた情報的特徴を抽出する命令対応クエリ変換器を導入する。 13のホールドインデータセットに基づいてトレーニングされたInstructBLIPは、13のホールドアウトデータセットすべてで最先端のゼロショットパフォーマンスを実現し、BLIP-2とより大きなFlamingoモデルを大幅に上回っている。私たちのモデルは、個々の下流タスク(例えば、画像コンテキストのScienceQA質問における90.7%の精度)で微調整された場合、最先端のパフォーマンスにもつながります。さらに,並列マルチモーダルモデルに対する命令BLIPの利点を質的に示す。すべてのinstructblipモデルは、https://github.com/salesforce/lavis/tree/main/projects/instructblipでオープンソースである。

論文の概要: InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

関連論文リスト