Fugu-MT 論文翻訳(概要): CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

論文の概要: CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

arxiv url: http://arxiv.org/abs/2511.11034v1
Date: Fri, 14 Nov 2025 07:41:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-17 22:42:18.481747
Title: CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging
Title（参考訳）: CrossMed: 医用画像における合成一般化のためのマルチモーダルクロスタスクベンチマーク
Authors: Pooja Singh, Siddhant Ujjain, Tapan Kumar Gandhi, Sandeep Kumar,
Abstract要約: 医用視覚言語モデルの合成一般化(CG)を評価するベンチマークであるCrossMedを紹介する。 4つの公開データセットを統一的な視覚的質問応答(VQA)フォーマットに再構成し、20,200の複数選択QAインスタンスを生成する。関連する分割で訓練されたモデルは83.2%の分類精度と0.75のセグメンテーションcIoUを実現し、非関連条件とゼロオーバーラップ条件では性能が著しく低下する。
参考スコア（独自算出の注目度）: 2.9857131541387827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.
Abstract（参考訳）: マルチモーダルな大規模言語モデルの最近の進歩は、視覚的およびテキスト的入力の統一的な処理を可能にし、汎用医療AIに有望な応用を提供する。しかし、画像のモダリティ、解剖学、タスクタイプなど、目に見えない組み合わせで構成を一般化する能力は、いまだ研究されていない。医療用マルチモーダルLCMにおける構成一般化(CG)を評価するためのベンチマークであるCrossMedについて,MAT (Structured Modality-Anatomy-Task) スキーマを用いて紹介する。 CrossMedは、CheXpert(X線分類)、SIIM-ACR(X線分割)、BraTS 2020(MRI分類とセグメンテーション)、MosMedData(CT分類)の4つのパブリックデータセットを、統一された視覚的質問応答(VQA)フォーマットに再構成し、20,200のマルチチョイスQAインスタンスを生成する。 LLaVA-Vicuna-7B と Qwen2-VL-7B という2つのオープンソースマルチモーダル LLM を、関連する MAT と無関係の MAT の分割と、テスト三重項がモダリティ、解剖学、タスクを共有しないゼロオーバーラップ設定で評価した。関連する分割で訓練されたモデルは、83.2%の分類精度と0.75のセグメンテーションcIoUを達成する一方、非関連条件とゼロオーバーラップ条件では性能が著しく低下し、ベンチマークの難しさを示している。また,分類専用データを用いてトレーニングした場合でも,セグメンテーション性能が7%向上するクロスタスク転送を示す。従来のモデル(ResNet-50とU-Net)は、MATフレームワークの幅広い実用性を確認しつつ、構成一般化において独特に優れている。 CrossMedは、医療ビジョン言語モデルにおけるゼロショット、クロスタスク、モダリティ非依存の一般化を評価するための厳密なテストベッドを提供する。

論文の概要: CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

関連論文リスト