Fugu-MT 論文翻訳(概要): Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

論文の概要: Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

arxiv url: http://arxiv.org/abs/2604.08333v1
Date: Thu, 09 Apr 2026 15:07:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.978578
Title: Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
Title（参考訳）: ハイプにおける損失:画像分類における医療用マルチモーダル大言語モデルの性能劣化の解明と解離
Authors: Xun Zhu, Fanbin Mo, Xi Chen, Kaili Zheng, Shaoshuai Yang, Yiming Shi, Jian Gao, Miao Li, Ji Wu,
Abstract要約: マルチモーダル大言語モデル(MLLM)は、医療画像解析の分野で前例のない応用の波を引き起こしている。しかし、医学画像分類では、最先端の医療MLLMは従来のディープラーニングモデルと比べて一貫して性能が劣っている。本稿では、3つの代表的な画像分類データセットにまたがる14のオープンソース医療MLLMについて広範な実験を行った。
参考スコア（独自算出の注目度）: 14.247959730104085
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.
Abstract（参考訳）: MLLM(Multimodal large language model)の台頭は、医療画像解析の分野で前例のない応用の波を引き起こした。しかし、このパラダイムに組み込まれた最も初期の、そして最も基本的なタスクの1つとして、医学画像分類は、落ち着く現実を明らかにしている。最先端の医療MLLMは、事前学習データやモデルパラメータの圧倒的な優位性にもかかわらず、従来のディープラーニングモデルと比べて一貫してパフォーマンスが劣っている。このパラドックスは、批判的な再考を促している。本稿では、3つの代表的な画像分類データセットにまたがる14のオープンソース医療MLLMについて広範な実験を行った。表面的なパフォーマンスベンチマークを超えて、私たちはMLLMパイプライン全体を通して視覚的特徴のモジュール単位とレイヤ単位の情報フローを追跡し、分類信号が歪んだり、希薄になったり、あるいはオーバーライドされたりする場所や方法の明示的な可視化を可能にします。医療用MLLMの分類性能劣化を識別する最初の試みとして,4つの障害モードが明らかになった。 1)視覚表現における品質制限 2 コネクタ・プロジェクションにおける忠実度損失 3 LLM推論における理解障害、及び 4)意味マッピングの誤調整。一方、機能進化の健全性を特徴付ける定量的スコアを導入し、多様なMLLMとデータセットの原理的な比較を可能にした。さらに,現在の医療MLLMが望まれる臨床的可能性を満たすのを防ぐ重要な障壁を中心に,洞察に富んだ議論を行う。我々は、我々の研究が、高い期待から臨床展開可能なMLLMへの道が長く、曲がりくねったままである、というコミュニティのハイライトの中で再考を促すことを願っている。

関連論文リスト

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images [16.362951636873248]
MLLM(Generalist Multimodal Large Language Model)は、様々な視覚言語タスクにおいて、優れたパフォーマンスを実現している。しかし、特に一般化が重要となるゼロショット環境での医療タスクにおけるパフォーマンスは、依然として最適以下である。本稿では,最先端医療MLLMの視覚的基盤機能に関する先駆的な研究について述べる。
論文参考訳（メタデータ） (2026-03-15T10:46:27Z)
Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative [14.002322217782364]
マルチモーダル大言語モデル(MLLM)は,医療視覚的質問応答(VQA)とレポート生成において有望な性能を示す。膝関節症(OA)分類におけるMLLMアーキテクチャの検討を行った。
論文参考訳（メタデータ） (2026-01-05T13:31:44Z)
EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow [43.82288530883818]
EH-Benchmarkは、医学大言語モデルにおける幻覚を評価するために設計された新しい眼科ベンチマークである。特定のタスクとエラータイプに基づいて幻覚を視覚的理解と論理的構成の2つの主要クラスに分類する。我々のフレームワークは、両方の幻覚、精度、解釈可能性、信頼性を著しく軽減します。
論文参考訳（メタデータ） (2025-07-24T12:07:36Z)
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM [58.2298313720146]
マルチモーダル幻覚は多源性であり、様々な原因から生じる。既存のベンチマークでは、知覚誘発幻覚と推論誘発幻覚を適切に区別することができない。
論文参考訳（メタデータ） (2025-05-30T05:54:36Z)
MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
MLLM(Multimodal Large Language Models)の最近の進歩は、テキストや画像などの多様なモダリティを統合する上で、有望な成果を示している。 MLLMはモダリティバイアスに強く影響され、しばしば言語に依存し、視覚入力のような他のモダリティを過小評価する。本稿では,MLLMはモダリティバイアスの影響を強く受けており,様々なタスクにまたがってその発現を明らかにする。
論文参考訳（メタデータ） (2025-05-24T11:49:31Z)
Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping [13.726496817874152]
非造影CTによる頭蓋内出血(ICH)のタイムリー同定は予後予測と治療的判断に重要である。本研究は、ICHバイナリ分類およびサブタイプにおける従来のディープラーニング手法と比較して、ゼロショットマルチモーダル大言語モデル(MLLM)の性能を評価する。
論文参考訳（メタデータ） (2025-05-14T09:54:46Z)
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? [59.81732629438753]
LLaVA-RadZは、既存のMLLM機能を利用して、ゼロショットの医療疾患認識のための、シンプルで効果的なフレームワークである。具体的には、MLLMデコーダアーキテクチャの特性を活用するために、DFAT(Decoding-Side Feature Alignment Training)と呼ばれるエンドツーエンドのトレーニング戦略を設計する。また,大規模モデルの本質的な医学的知識を活用するために,DKAM(Domain Knowledge Anchoring Module)を導入する。
論文参考訳（メタデータ） (2025-03-10T16:05:40Z)
Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding [92.32881381717594]
医療情報抽出タスクにおける幻覚の問題を解決するために,ALCD(ALternate Contrastive Decoding)を導入する。 ALCDは, 従来の復号法に比べて幻覚の解消に有意な改善が見られた。
論文参考訳（メタデータ） (2024-10-21T07:19:19Z)
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE)は、トレーニングフリーかつAPIフリーのフレームワークである。 MARINEは、LVLMに画像グラウンドガイダンスを導入することにより、推論中の物体の幻覚を効果的かつ効率的に低減する。私たちのフレームワークの柔軟性は、さらに複数のビジョンモデルの統合を可能にし、より信頼性が高く堅牢なオブジェクトレベルのガイダンスを可能にします。
論文参考訳（メタデータ） (2024-02-13T18:59:05Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。