Fugu-MT 論文翻訳(概要): Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

論文の概要: Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2507.12566v1
Date: Wed, 16 Jul 2025 18:31:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-18 20:10:24.24275
Title: Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Title（参考訳）: Mono-InternVL-1.5: より安全で高速なモノリシックなマルチモーダル言語モデルを目指して
Authors: Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai,
Abstract要約: 本稿では,モノリシックなマルチモーダル大言語モデル(MLLM)について述べる。モノリシックMLLMの既存の構造と事前学習戦略は不安定な最適化と破滅的な忘れ込みに悩まされることが多い。これらの課題に対処するために、我々は、新しい視覚パラメータ空間を事前訓練されたLLMに組み込むことで、デルタチューニングによるノイズの多いデータから視覚知識の安定した学習を可能にする。
参考スコア（独自算出の注目度）: 70.59376970630387
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.
Abstract（参考訳）: 本稿では,視覚的エンコーディングと言語復号を単一モデルに統合したモノリシックなマルチモーダル言語モデル(MLLM)について述べる。モノリシックMLLMの既存の構造と事前学習戦略は不安定な最適化と破滅的な忘れ込みに悩まされることが多い。これらの課題に対処するために、我々は、新しい視覚パラメータ空間を事前訓練されたLLMに組み込むことで、デルタチューニングによるノイズの多いデータから視覚知識の安定した学習を可能にする。この原理に基づいて、我々はまずMono-InternVLを紹介した。Mono-InternVLは高度なモノリシックなMLLMで、マルチモーダル・オブ・エキスパート・アーキテクチャを通じて視覚専門家の集合を組み込む。さらに,Mono-InternVLのための革新的な内因性視覚前訓練(EViP)を設計し,その視覚能力の最大化を図る。 Mono-InternVLは既存のMLLMと競合する性能を実現するが、データコストも比較的高い。そこで我々は,改良されたEViP(EViP++)を備えた安価で強力なモノリシックMLLMであるMono-InternVL-1.5を提案する。 EViP++はMono-InternVL-1.5に新たな視覚的注意の専門家を導入し、トレーニング前のプロセスを効率的に再編成する。推論中は、CUDAカーネルが融合してMoE操作を高速化する。これらの設計により、Mono-InternVL-1.5は、Mono-InternVLとの競合性能を維持しながら、トレーニングと推論のコストを大幅に削減する。提案手法を評価するため,15ベンチマークにまたがる広範囲な実験を行った。結果として、Mono-InternVLは、OCRBench上のEmu3よりも15のベンチマークのうち12のベンチマークで既存のモノリシックMLLMよりも優れていることが示されている。 InternVL-1.5と比べ、Mono-InternVL-1.5は同様のマルチモーダル性能を実現し、初歩遅延を最大69%削減した。コードとモデルはhttps://github.com/OpenGVLab/Mono-InternVLで公開されている。

論文の概要: Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

関連論文リスト