Fugu-MT 論文翻訳(概要): Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

論文の概要: Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

arxiv url: http://arxiv.org/abs/2606.09131v1
Date: Mon, 08 Jun 2026 07:28:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.799753
Title: Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
Title（参考訳）: マルチモーダル大規模言語モデルの視覚飽和下でのデュアルパスビジョントケルーティング
Authors: Siyuan Liu, Jinyang Wu,
Abstract要約: マルチモーダル大言語モデル (MLLM) は、通常、非モーダルテキストモデリング用に設計された深い対称トランスフォーマーのバックボーンを継承する。この設計では、画像とテキストトークンは情報密度、冗長性、必要な推論深度で大きく異なるという、重要なモダリティ非対称性を見落としている。効率的なMLLMのためのモダリティ非対称なルーティングフレームワークであるDual-Path Vision Token Routing (DPVR)を提案する。
参考スコア（独自算出の注目度）: 6.369257323378483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.
Abstract（参考訳）: マルチモーダル大言語モデル (MLLM) は、通常、非モーダルテキストモデリング用に設計された深い対称トランスフォーマーのバックボーンを継承し、画像および言語トークンに同じ計算を均一に適用する。この設計では、画像とテキストトークンは情報密度、冗長性、必要な推論深度で大きく異なるという、重要なモダリティ非対称性を見落としている。 LLaVA-1.5の層構造解析により,中間層で視線トークンが飽和する傾向が観察された。具体的には、第0層で0.68から第4層で0.07に減少し、第18層で0.04近く安定するのに対して、テキストトークンは深いセマンティック処理の恩恵を受け続けている。これらの結果から,アーキテクチャ対称性と深度-非同期的モダリティ進化のミスマッチが示唆され,冗長な視覚計算と深部タスク特異的適応時の知覚表現のドリフトが生じる可能性が示唆された。そこで我々は,効率的なMLLMのためのモダリティ非対称なルーティングフレームワークであるDual-Path Vision Token Routing (DPVR)を提案する。その中核的なインスタンス化であるDPVR-LF(Late-Layer Fusion)は、飽和点の視覚トークンを1層のトレーニング可能なサイドブランチにルーティングし、13層のテキストのみのフォワードを実行し、深いスタックのイメージ位置をスキップし、最終的なレイヤでのみ視覚的およびテキストストリームを再融合する。約3%のトレーニング可能なパラメータを持つDPVR-LFは、Deep Transformerスタックのビジュアル計算を削減しつつ、標準ベンチマーク上での競合するマルチモーダル性能を保っている。この結果は、視覚トークンが全ての深層言語モデル層を横切る必要があるという従来の仮定に挑戦し、LLaVAスタイルのMLLMにおいて強力な知覚能力を維持するのに、単一の後期融合層が十分であることを示す。

論文の概要: Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

関連論文リスト