Fugu-MT 論文翻訳(概要): mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

論文の概要: mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

arxiv url: http://arxiv.org/abs/2304.14178v3
Date: Fri, 29 Mar 2024 08:13:38 GMT
ステータス: 翻訳完了
システム内更新日: 2024-04-01 20:56:17.086160
Title: mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Title（参考訳）: mPLUG-Owl:マルチモーダリティを持つ大規模言語モデルを実現するモジュール化
Authors: Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou,
Abstract要約: mPLUG-Owlは、大規模言語モデル(LLM)にマルチモーダル能力を持たせる訓練パラダイムである。トレーニングパラダイムは、LLMの助けを借りて視覚知識を学ぶ、画像とテキストの整列のための2段階の手法を含む。実験の結果,本モデルは既存のマルチモーダルモデルよりも優れていた。
参考スコア（独自算出の注目度）: 95.76661165594884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々なオープンエンドタスクにおいて印象的なゼロショット能力を示し、最近の研究では、マルチモーダル生成にLLMを使うことも検討されている。本研究では,基礎LLMのモジュール化学習,視覚知識モジュール,視覚抽象モジュールなどを通じて,LLMにマルチモーダル能力を持たせる新しいトレーニングパラダイムであるmPLUG-Owlを紹介する。このアプローチは、複数のモダリティをサポートし、モダリティの協調を通じて、多様なモダリティとマルチモーダルの能力を促進する。 mPLUG-Owlのトレーニングパラダイムには、画像とテキストの整列のための2段階の手法が含まれており、LLMの生成能力を維持し、改善しながら、LLMの助けを借りて視覚知識を学ぶ。第1段階では、視覚知識モジュールと抽象モジュールが凍結LDMモジュールで訓練され、画像とテキストが整列される。第2段階では、言語のみおよびマルチモーダル教師付きデータセットを使用して、視覚的知識モジュールを凍結することにより、LLMと抽象モジュールの低ランク適応(LoRA)モジュールを協調的に微調整する。我々は、視覚関連命令評価セットOwlEvalを慎重に構築する。実験の結果,本モデルは既存のマルチモーダルモデルよりも優れており,mPLUG-Owlの印象的な指導と視覚的理解能力,マルチターン会話能力,知識推論能力などが示された。さらに,複数画像の相関やシーンテキストの理解など,予期せぬ,エキサイティングな能力が観察され,視覚のみの文書理解など,より複雑なシナリオに活用できるようになった。我々のコード、事前訓練されたモデル、命令調整されたモデル、評価セットはhttps://github.com/X-PLUG/mPLUG-Owl.comで入手できる。オンラインデモはhttps://www.modelscope.cn/studios/damo/mPLUG-Owl.comで公開されている。

論文の概要: mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

関連論文リスト