Fugu-MT 論文翻訳(概要): The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

論文の概要: The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

arxiv url: http://arxiv.org/abs/2512.08374v1
Date: Tue, 09 Dec 2025 08:57:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-10 22:28:07.88636
Title: The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss
Title（参考訳）: 未確認のバイアス:プリノームMLLMにおけるノームの相違が視覚的情報損失につながる
Authors: Bozhou Li, Xinda Xue, Sihan Yang, Yang Shi, Xinlong Chen, Yushuo Guan, Yuanxing Zhang, Wentao Zhang,
Abstract要約: MLLM(Multimodal Large Language Models)は、事前訓練された視覚エンコーダと言語モデルである。ユビキタスなPre-Normアーキテクチャへの依存は、ハイノームのビジュアルトークンとローノームのテキストトークンの間に深刻な標準格差をもたらす。視覚プロジェクタの後に、注意深く1つのLayerNorm層を挿入して、ノルムアライメントを強制する、シンプルで効果的なソリューションを提案する。
参考スコア（独自算出の注目度）: 15.598471176315913
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ``asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic -- the persistence of norm disparity and the resulting asymmetric update rates -- is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.
Abstract（参考訳）: 事前訓練された視覚エンコーダと言語モデルを組み合わせたMLLM(Multimodal Large Language Models)が注目に値する機能を示している。しかし、ユビキタスなPre-Normアーキテクチャへの依存は微妙だが重大な欠陥をもたらす。本研究では,この不均衡が静的問題ではないことを示す公式な理論的解析を行う。代わりに、'`asymmetric update dynamic'' を誘導し、ハイノームな視覚トークンは '`representational inertia'' を示す。これは、効果的にクロスモーダルな特徴融合を損なう。 MLLMの主流範囲にわたる実証的な検証は、この理論的ダイナミクス -- 標準格差の持続性と結果として生じる非対称な更新率 -- が一般的な現象であることを確認した。この知見に基づいて、視覚プロジェクタの後に1つの慎重に初期化されたLayerNorm層を挿入し、ノルムアライメントを強制する、驚くほどシンプルで効果的なソリューションを提案する。 LLaVA-1.5アーキテクチャで実施された実験により、この介入は、幅広いマルチモーダルベンチマークだけでなく、特にMMLUのようなテキストのみの評価においても大きなパフォーマンス向上をもたらすことが示され、アーキテクチャの不均衡の解消は、より整合性のあるモデルをもたらすことが示唆された。

論文の概要: The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

関連論文リスト