Fugu-MT 論文翻訳(概要): Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

論文の概要: Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2512.06281v1
Date: Sat, 06 Dec 2025 04:20:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.282245
Title: Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
Title（参考訳）: マルチモーダル大言語モデルの内在的視覚表現能力の開放
Authors: Hengzhuang Li, Xinsong Zhang, Qiming Peng, Bin Luo, Han Hu, Dengyang Jiang, Han-Jia Ye, Teng Zhang, Hai Jin,
Abstract要約: より識別的な視覚表現の学習において,MLLMの学習を容易にする新しい学習フレームワークであるLaVerを提案する。本手法はMLLMに対して直接視覚的アクティベーションを提供し,視覚的アサインメントが増大し,視覚情報の利用が向上したことを示す。
参考スコア（独自算出の注目度）: 58.91911788912665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は,マルチモーダルタスクにおいて顕著な習熟性を示した。 MLLMは印象的な性能にもかかわらず、モダリティの不均衡の問題に悩まされ、より深い層におけるテキスト表現に比べて視覚情報が不活用されることが多く、視覚的なパフォーマンスや幻覚が低下する。この問題は、トレーニング中の次のテキスト-トークン-予測に大きく依存していることから来ており、直接的な視覚的監督信号の提供に失敗し、レイヤ全体の視覚的表現の漸進的均質化をもたらす。この目的のために,LLM の結合潜在意味空間におけるマスク付き画像モデリングによるより識別的な視覚表現の学習を支援する新しい学習フレームワーク LaVer を提案する。本手法はMLLMに対して直接視覚的アクティベーションを提供し,視覚的アサインメントが増大し,視覚情報の利用が向上したことを示す。多様なベンチマークにわたる大規模な実験は、様々なシナリオ、特に高密度な視覚能力を必要とするものにおいて、我々のアプローチの優位性を証明している。 LaVerのコードはhttps://github.com/Fir-lat/LaVerで公開されている。

論文の概要: Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

関連論文リスト