Fugu-MT 論文翻訳(概要): InfMLLM: A Unified Framework for Visual-Language Tasks

論文の概要: InfMLLM: A Unified Framework for Visual-Language Tasks

arxiv url: http://arxiv.org/abs/2311.06791v2
Date: Wed, 6 Dec 2023 11:06:06 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-07 18:04:09.920793
Title: InfMLLM: A Unified Framework for Visual-Language Tasks
Title（参考訳）: InfMLLM:ビジュアル言語タスクのための統一フレームワーク
Authors: Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi
Abstract要約: マルチモーダルな大言語モデル (MLLM) が注目されている。この作業は、LLMがより視覚的な言語に関連したタスクに取り組むことを可能にすることを目的としている。 InfMLLMは、最先端(SOTA)パフォーマンスまたは最近のMLLMに匹敵するパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 44.29407348046122
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a straightforward visual adapter module dubbed pool-adapter. Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding. We name our proposed approach InfMLLM and have evaluated it extensively on various benchmark datasets. Our results demonstrate that InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs. The code and model will be made open-source at: \url{https://github.com/mightyzau/InfMLLM}.
Abstract（参考訳）: 大規模言語モデル(LLM)は、包括的な言語中心のアプリケーションを扱う上で、その顕著な汎用性を証明している。 LLMの機能をより広い範囲のモーダル入力に拡張するために、マルチモーダル大言語モデル(MLLM)が注目されている。この作業は、llmがより視覚言語に関連したタスク、特に画像キャプション、視覚質問応答(vqa)、視覚の接地に取り組むことを可能にすることに役立ちます。この目的のために,軽量アライメントプリトレーニングから中等級のマルチタスクハイブリッドトレーニング,最後にllm微調整による命令追従能力の向上という3段階のトレーニングスキームを実装した。トレーニングプロセスを通じて、GPUメモリの要件は徐々に増加する。位置情報を保存しながらLLMに渡される視覚的埋め込み数を効果的に管理するために,プールアダプタと呼ばれる単純な視覚的アダプターモジュールを導入する。実験により,プール適応器を通して視覚埋め込みの位置情報を保存することは,視覚接地などのタスクに特に有益であることが示された。我々は,提案手法をInfMLLMと命名し,様々なベンチマークデータセットで広く評価した。以上の結果から,InfMLLMは最新のMLLMに匹敵する,最先端のSOTA(State-of-the-art)性能を達成できることが示された。コードとモデルはオープンソースにされる。 \url{https://github.com/mightyzau/InfMLLM}。

論文の概要: InfMLLM: A Unified Framework for Visual-Language Tasks

関連論文リスト