Fugu-MT 論文翻訳(概要): Aligning Large Multi-Modal Model with Robust Instruction Tuning

論文の概要: Aligning Large Multi-Modal Model with Robust Instruction Tuning

arxiv url: http://arxiv.org/abs/2306.14565v1
Date: Mon, 26 Jun 2023 10:26:33 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-27 14:05:46.032850
Title: Aligning Large Multi-Modal Model with Robust Instruction Tuning
Title（参考訳）: ロバスト命令チューニングによる大規模マルチモーダルモデルの調整
Authors: Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, Lijuan Wang
Abstract要約: 本稿では,Large-scale Robust Visual (LRV)-Instructionという,大規模かつ多様な視覚的インストラクションチューニングデータセットを紹介する。我々のデータセットは、GPT4が生成した120kの視覚命令で構成されており、16の視覚・言語タスクをオープンエンドの指示と回答でカバーしている。 LMMによる幻覚を効果的に測定するために,GAVIE(GPT4-Assisted Visual Instruction Evaluation)を提案する。
参考スコア（独自算出の注目度）: 70.00006772808264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Our project link is available at https://fuxiaoliu.github.io/LRV/.
Abstract（参考訳）: マルチモーダルタスクの有望な進歩にもかかわらず、現在の大規模マルチモーダルモデル(LMM)は、関連する画像と人間の指示に関して一貫性のない記述を幻覚させる傾向にある。本稿では,Large-scale Robust Visual (LRV)-Instructionという,大規模かつ多様な視覚的命令チューニングデータセットを導入することでこの問題に対処する。我々のデータセットは、GPT4が生成した120kの視覚命令で構成されており、16の視覚・言語タスクをオープンエンド命令と回答でカバーしている。主に正の命令サンプルに焦点を当てた既存の研究とは異なり、我々は、より堅牢な視覚的命令チューニングのための正と負の両方の命令を含むLRV-インストラクションを設計する。私たちの否定的な指示は2つの意味レベルで設計されます。一存在しない要素の操作及び操作 (II)既存の要素操作 LMMが生み出す幻覚を効果的に測定するために,人間の注釈を付さずに視覚指導のチューニングを評価する新しい手法であるGAVIE(GPT4-Assisted Visual Instruction Evaluation)を提案する。われわれはLMMの幻覚を調査するための総合的な実験を行った。以上の結果から,既存のLMMは負の指示,特に既存要素操作命令で有意な幻覚を示すことが明らかとなった。さらに, LRV-InstructionでMiniGPT4を微調整することにより, 従来の手法に比べてトレーニングデータが少なく, 公開データセットの性能を向上しながら幻覚を緩和することに成功した。さらに,トレーニングデータにおける正のインスタンスと負のインスタンスのバランスの取れた比率が,より堅牢なモデルにつながることを観測した。プロジェクトリンクはhttps://fuxiaoliu.github.io/lrv/で閲覧できます。

論文の概要: Aligning Large Multi-Modal Model with Robust Instruction Tuning

関連論文リスト