Fugu-MT 論文翻訳(概要): Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

論文の概要: Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

arxiv url: http://arxiv.org/abs/2512.19663v1
Date: Mon, 22 Dec 2025 18:41:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-23 18:54:32.882117
Title: Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
Title（参考訳）: 糖尿病網膜症診断におけるクロスモーダルアライメントのための知識強化型マルチモーダルトランスフォーマ
Authors: Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra,
Abstract要約: 本稿では,網膜基底像,臨床テキスト,構造化された患者データを統合する知識強化型関節埋め込みフレームワークを提案する。このフレームワークはRecall@1の99.94%でほぼ完璧なテキスト・ツー・イメージ検索性能を実現している。
参考スコア（独自算出の注目度）: 7.945705180020063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
Abstract（参考訳）: 糖尿病網膜症(DR)は、世界中で予防可能な失明の原因であり、正確な自動診断システムを必要としている。 Contrastive Language-Image Pre-Training (CLIP)のような一般的なドメインビジョン言語モデルは、自然画像のタスクではうまく機能するが、特に眼科領域での画像のクロスモーダル検索では苦戦している。本稿では,網膜基底画像,臨床テキスト,構造化された患者データをマルチモーダルトランスフォーマーアーキテクチャにより統合し,医用画像・テキストアライメントにおける重要なギャップに対処する,新しい知識強化型共同埋め込みフレームワークを提案する。本手法では、網膜画像用視覚変換器(ViT-B/16)、臨床物語用バイオクリニカルBERT、構造化された人口統計学的特徴および臨床特徴用多層パーセプトロンを用いている。これらのモダリティは、モダリティ固有の埋め込みを持つジョイントトランスフォーマーを通じて融合され、モダリティペア間のコントラスト損失、画像とテキストの再構成損失、ICDRおよびSDRGスキームによるDR重度グレーディングの分類損失など、複数の目的を用いて訓練される。ブラジルのBRSET(Multilabel Ophthalmological Dataset)の実験結果は、ベースラインモデルよりも大幅に改善されている。 Recall@1が99.94%、CLIPが1.29%、SDRGが97.05%、ICDRが97.97%である。さらに、目に見えないDeepEyeNetデータセットのゼロショット評価は、微調整のCLIPでは0.22%に対して93.95%のRecall@1で強力な一般化性を検証する。以上の結果から,我々のマルチモーダルトレーニングアプローチは医療領域における相互モーダル関係を効果的に捉え,優れた検索能力と堅牢な診断性能の両立を図っている。

論文の概要: Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

関連論文リスト