Fugu-MT 論文翻訳(概要): Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

論文の概要: Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2603.20808v1
Date: Sat, 21 Mar 2026 13:10:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.095783
Title: Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
Title（参考訳）: マルチモーダル大言語モデルにおける視覚表現劣化に対する予測正規化
Authors: Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng,
Abstract要約: 我々は,MLLMにおける視覚的表現の劣化という,広範にわたる課題を明らかにするために,詳細な診断分析を行う。我々は,この現象を,単一のテキスト生成目標によって引き起こされる視覚的犠牲とみなし,そのモデルが解答生成の最適化のためにその視覚的忠実度を損なう。本研究では,初期視覚特性を予測するために,劣化した中間特徴を強制的に予測し,MLLMの内部表現に固有の視覚特性を維持するための予測正則化を提案する。
参考スコア（独自算出の注目度）: 84.94288033791346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は視覚言語タスクに優れるが、言語による内部視覚基盤能力訓練のコストは未定である。本稿では,MLLMにおける視覚的表現の劣化という,広範にわたる課題を明らかにするための詳細な診断分析を行う。具体的には、初期の視覚的特徴と比較して、LLMの中間層における視覚的表現は、大域的機能とパッチ構造の両方の劣化を示す。我々は,この現象を,単一のテキスト生成目標によって引き起こされる視覚的犠牲とみなし,そのモデルが解答生成の最適化のためにその視覚的忠実度を損なう。我々は、堅牢なMLLMは強力なクロスモーダル推論とコア視覚能力の両方を必要とし、劣化した中間特徴を初期視覚的特徴を予測するために強制的に予測正則化(PRe)を提案し、MLLMの内部表現の固有の視覚特性を維持する。この視覚的劣化を緩和することは視覚言語のパフォーマンスを効果的に向上させ、包括的なマルチモーダル理解のためにMLLMの内部の堅牢な視覚表現を育むことの重要性を強調している。

論文の概要: Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

関連論文リスト