Fugu-MT 論文翻訳(概要): The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

論文の概要: The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

arxiv url: http://arxiv.org/abs/2604.20665v1
Date: Wed, 22 Apr 2026 15:15:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:11.1888
Title: The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
Title（参考訳）: モノリシックパラダイムにおける信頼性の高いマルチモーダル推論の実現
Authors: Karan Goyal, Dikshant Kukreja,
Abstract要約: 我々は、現在のビジョン・ランゲージ・モデルがマルチモーダルデータを忠実に合成するわけではないと論じる。モーダリティ変換プロトコル(Modality Translation Protocol)を提案する。
参考スコア（独自算出の注目度）: 1.0742675209112622
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.
Abstract（参考訳）: VLM(Vision-Language Models)の急速な普及は、統合マルチモーダル知識発見の夜明けとして広く祝われているが、その基盤は危険で疑わしい公理に基づいており、現在のVLMはマルチモーダルデータを忠実に合成している。彼らはそうではないと私たちは主張する。代わりに、信頼性の深刻な危機は、支配的なビジョンエンコーダ-プロジェクタ-LLMパラダイムの根底にある。ビジュアルインプットから基礎知識を抽出する代わりに、最先端のモデルはしばしば機能的盲点を示す。本研究では,データアブレーションや新しいデータセット生成に依存した従来のマルチモーダル評価手法に挑戦する。本稿では,情報理論の急進的脱却(Modality Translation Protocol)を提案する。セマンティック・サフィシビリティ・基準(SSC)で終わる3つの新しいメトリクス – Toll(ToS)、Curse(CoS)、Fallacy(FoS) – を分類するよりも、セマンティック・ペイロードを翻訳することで、セマンティック・サフィシビリティ・基準(SSC)に到達します。さらに,マルチモーダルスケーリングの挑発的多様化法を提唱し,基礎となる言語エンジンが前例のない推論能力にスケールするにつれて,視覚的知識ボトルネックの数学的ペナルティがパラドックス的に増加すると仮定した。我々は,KDDコミュニティに対して,「マルチモーダル・ゲイン」という幻想的な追求を放棄するよう呼びかける。 SSCをパッシブな診断制約からアクティブなアーキテクチャの青写真へと高めることで、私たちは、次世代のAIシステムを真にデータを見るように強制するために必要な厳格で信頼できる基盤を提供し、真のマルチモーダルな推論を実現します。

論文の概要: The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

関連論文リスト