Fugu-MT 論文翻訳(概要): UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

論文の概要: UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

arxiv url: http://arxiv.org/abs/2510.15710v2
Date: Mon, 27 Oct 2025 19:55:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:36.21368
Title: UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
Title（参考訳）: UniMedVL:観察知識分析による医療マルチモーダル理解と生成の統合
Authors: Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He,
Abstract要約: 画像理解と生成タスクの同時解析のための医用統合マルチモーダルモデルUniMedVLを紹介する。 UniMedVLは5つの医用画像理解ベンチマークにおいて優れた性能を示し、8つの医用画像モダリティにまたがる生成品質のモデルに適合する。
参考スコア（独自算出の注目度）: 41.864457631668806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.
Abstract（参考訳）: 医療診断アプリケーションは、マルチモーダルな医療入力(画像、患者の履歴、実験結果)を処理し、テキストレポートとビジュアルコンテンツ(注釈、セグメンテーションマスク、画像)を含む多様なアウトプットを生成するモデルを必要とする。医用画像理解モデルはイメージを解釈するが、視覚的な出力は生成できないが、医用画像生成モデルは画像を合成するが、テキストによる説明は提供できない。これにより、データ表現、機能統合、タスクレベルのマルチモーダル能力のギャップが生じる。そこで本研究では,診断ワークフローからインスピレーションを得るための多層フレームワークを提案する。具体的には、観測レベルでは、5.6M以上のサンプルからなるデータセットであるUniMed-5Mを構築し、基礎的な観測のために多様な単調データをマルチモーダルペアに再構成する。知識レベルでは,医学的マルチモーダル知識を体系的に導入するプログレッシブカリキュラム学習を提案する。解析レベルでは、単一のアーキテクチャ内での画像理解と生成タスクを同時解析する最初の医用統合マルチモーダルモデルUniMedVLを導入する。 UniMedVLは5つの医用画像理解ベンチマークにおいて優れた性能を示し、8つの医用画像モダリティにまたがる生成品質のモデルに適合する。私たちの統合アーキテクチャは、双方向の知識共有を可能にします。生成タスクは、視覚的理解機能を強化し、単一の医療フレームワークに伝統的に分離された機能を統合することで、さまざまな医療ビジョン言語タスクの改善を可能にします。コードはhttps://github.com/uni-medical/UniMedVLで入手できる。

論文の概要: UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

関連論文リスト