Fugu-MT 論文翻訳(概要): Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

論文の概要: Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

arxiv url: http://arxiv.org/abs/2511.00916v1
Date: Sun, 02 Nov 2025 12:30:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.993117
Title: Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs
Title（参考訳）: Fleming-VL:マルチモーダルLDMを用いたユニバーサル医用ビジュアル推論を目指して
Authors: Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai,
Abstract要約: Fleming-VLは不均一なモーダルを包括的に理解するためのフレームワークである。 Fleming-VLは、医療用VQA、ビデオQA、医用画像理解など、複数のベンチマークで最先端のパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 7.542510160217106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、視覚的質問応答や画像キャプションなど、様々な汎用ドメインシナリオにおいて顕著な効果を示している。近年、医学的会話能力を持つMLLMの強化に焦点が当てられ、臨床応用に大きな期待が持たれている。しかし、医用データは、その異種性により固有の課題を示しており、2D画像、3Dボリュームスキャン、時間的ビデオシーケンスを含む様々なモダリティを含んでいる。これらのモダリティ間のドメインギャップとデータフォーマットの不整合により、統合医療MLLMの開発が妨げられている。これらの課題に対処するため、不均一なモダリティにまたがる包括的医用視覚理解のための統合エンドツーエンドフレームワークであるFleming-VLを提案する。 Fleming-VL は,(1) 自然領域と医学領域の両方から長文データを統合して事前学習を拡大すること,(2) 画像解析や超音波や皮膚内視鏡画像などの2Dモダリティの不足など,稀な医療データとの微調整を補完すること,(3) 既存の評価フレームワークを3Dボリュームとビデオ理解ベンチマークに拡張すること,の3つの主要な戦略を通じて,この問題に対処する。教師付き微調整(SFT)とグループ相対ポリシー最適化(GRPO)により、Fleming-VLを複数のモデルスケールで開発する。 Fleming-VLは、医療用VQA、ビデオQA、医用画像理解など、複数のベンチマークで最先端のパフォーマンスを実現している。医療AIの透明性、再現性、監査可能な進歩を促進するために、Fleming-VLを公開しています。

論文の概要: Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

関連論文リスト