Fugu-MT 論文翻訳(概要): EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

論文の概要: EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

arxiv url: http://arxiv.org/abs/2409.06644v2
Date: Wed, 11 Sep 2024 17:00:09 GMT
ステータス: 翻訳完了
システム内更新日: 2024-09-12 13:13:20.757296
Title: EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis
Title（参考訳）: EyeCLIP:マルチモーダル眼科画像解析のための視覚言語基礎モデル
Authors: Danli Shi, Weiyi Zhang, Jiancheng Yang, Siyu Huang, Xiaolan Chen, Mayinuer Yusufu, Kai Jin, Shan Lin, Shunming Liu, Qing Zhang, Mingguang He,
Abstract要約: 本研究では,277万点以上の眼科画像と部分テキストデータを用いた視覚言語基盤モデルであるEyeCLIPを提案する。 EyeCLIPは、眼疾患や全身疾患を含む幅広い下流のタスクに移行することができる。
参考スコア（独自算出の注目度）: 20.318178211934985
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While artificial intelligence (AI) foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.
Abstract（参考訳）: 緑内障、黄斑変性、糖尿病網膜症などの眼疾患の早期発見は、視力喪失の予防に不可欠である。人工知能(AI)ファンデーションモデルはこれらの課題に対処する上で大きな可能性を秘めているが、既存の眼科ファンデーションモデルは、主に単一のモダリティに焦点をあてる一方で、眼疾患の診断には複数のモダリティが必要である。批判的だが、しばしば見落とされがちな側面は、同一患者の様々なモダリティにまたがる多視点情報を活用することである。さらに、眼疾患の長い尾部の性質のため、標準的な完全な教師なしまたは教師なしの学習アプローチは、しばしば苦労する。したがって、より広い範囲の疾患を捉えるために臨床テキストを統合することが不可欠である。部分テキストデータを用いた277万以上のマルチモーダル眼科画像を用いた視覚言語基盤モデルEyeCLIPを提案する。大規模マルチモーダルなラベル付きラベル付きデータを完全に活用するために,自己教師付き再構成,マルチモーダル画像コントラスト学習,画像テキストコントラスト学習を組み合わせた事前学習戦略を導入し,複数のモーダルの共有表現を学習した。 14のベンチマークデータセットを用いて評価することにより、EyeCLIPは、眼疾患や全身疾患を含む幅広い下流タスクに移行し、疾患分類、視覚的質問応答、モーダル検索において最先端のパフォーマンスを達成することができる。 EyeCLIPは、従来の手法、特に現実世界のロングテールシナリオにおいて、数ショット、ゼロショットの能力を示す重要な進歩を示している。

関連論文リスト

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
医用画像セグメンテーションにおける領域一般化に取り組むために, MCDRL(Multimodal Causal-Driven Representation Learning)を提案する。 MCDRLは競合する手法より一貫して優れ、セグメンテーション精度が優れ、堅牢な一般化性を示す。
論文参考訳（メタデータ） (2025-08-07T03:41:41Z)
A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models [28.34025112894094]
このレビューでは、タスク固有のマルチモーダルアプローチと大規模マルチモーダル基盤モデルという2つの主要なカテゴリに焦点を当てている。この調査は重要なデータセット、評価指標、方法論の革新について批判的に調査している。また、データの多様性、アノテーションの制限、解釈可能性の欠如、様々な患者集団における一般化可能性の問題など、現在進行中の課題についても論じている。
論文参考訳（メタデータ） (2025-07-31T10:49:21Z)
EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model [51.66031028717933]
Med-LVLM(Med-LVLM)は、医療において重要な可能性を示す。現在、知的眼科診断は、(i)データ、(ii)ベンチマーク、(iii)モデルという3つの大きな課題に直面している。我々は、前述の3つの課題に対処するEyecare Kitを提案する。
論文参考訳（メタデータ） (2025-04-18T12:09:15Z)
EyeDiff: text-to-image diffusion model improves rare eye disease diagnosis [7.884451100342276]
EyeDiffは、自然言語のプロンプトからマルチモーダル眼科画像を生成するために設計されたテキスト・ツー・イメージモデルである。 EyeDiffは8つの大規模なデータセットでトレーニングされており、10のマルチリージョンの外部データセットに適応している。
論文参考訳（メタデータ） (2024-11-15T07:30:53Z)
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models [38.78576472811659]
大規模視覚言語モデル(LVLM)は、解剖情報を理解し、眼疾患を診断し、解釈と追跡計画の作成を支援する可能性がある。我々は、クローズドソース、オープンソース、医療ドメインの13の最先端のLVLM代表をベンチマークした。その結果,眼科領域では他の領域と比較してLVLMが有意に低下した。
論文参考訳（メタデータ） (2024-10-02T14:57:58Z)
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
MVKLは,マルチビュー画像,詳細な表示,報告を含む最初のマルチモーダルマンモグラフィーデータセットである。このデータセットに基づいて、教師なし事前学習のチャラリングタスクに焦点を当てる。視覚,知識,言語機能を相乗化するフレームワークであるViKLを提案する。
論文参考訳（メタデータ） (2024-09-24T05:01:23Z)
EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging [13.88319807760491]
眼科画像のマルチモーダル基盤モデルであるEyeFoundを提案する。ラベルのないマルチモーダル網膜画像から一般化可能な表現を学習する。 11の眼科領域にわたる227の病院の278万枚の画像で訓練されている。
論文参考訳（メタデータ） (2024-05-18T17:03:39Z)
VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence [27.92420837559191]
VisionFMは560,457人の眼科画像340万枚を事前訓練した基礎モデルである。事前トレーニングの後、VisionFMは複数の眼科人工知能(AI)応用を育成する基盤を提供する。 VisionFMの一般知能は、12の一般的な眼科疾患を共同診断する際に、基礎的および中間的なレベルの眼科医より優れていた。
論文参考訳（メタデータ） (2023-10-08T03:40:14Z)
OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue [7.140551103766788]
我々は、眼科大言語と視覚アシスタント(OphGLM)を完成させるために、大きな言語モデルに視覚能力を導入する。実験の結果,OphGLMモデルは非常によく機能し,眼科における臨床応用に革命をもたらす可能性が示唆された。
論文参考訳（メタデータ） (2023-06-21T11:09:48Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
LVM-Medは、大規模医療データセットに基づいてトレーニングされた、最初のディープネットワークファミリーである。 55の公開データセットから約13万の医療画像を収集しました。 LVM-Medは、多くの最先端の教師付き、自己監督型、基礎モデルよりも経験的に優れている。
論文参考訳（メタデータ） (2023-06-20T22:21:34Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
視覚言語処理における自己教師あり学習は、画像とテキストのモダリティのセマンティックアライメントを利用する。トレーニングと微調整の両方で利用できる場合、事前のイメージとレポートを明示的に説明します。我々のアプローチはBioViL-Tと呼ばれ、テキストモデルと共同で訓練されたCNN-Transformerハイブリッドマルチイメージエンコーダを使用する。
論文参考訳（メタデータ） (2023-01-11T16:35:33Z)
GraVIS: Grouping Augmented Views from Independent Sources for Dermatology Analysis [52.04899592688968]
皮膚科画像から自己教師付き特徴を学習するために特に最適化されたGraVISを提案する。 GraVISは、病変のセグメンテーションと疾患分類のタスクにおいて、転送学習と自己教師型学習を著しく上回っている。
論文参考訳（メタデータ） (2023-01-11T11:38:37Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUGは、クロスモーダルな理解と生成のための新しいビジョン言語基盤モデルである。画像キャプション、画像テキスト検索、視覚的グラウンドリング、視覚的質問応答など、幅広い視覚言語下流タスクの最先端結果を達成する。
論文参考訳（メタデータ） (2022-05-24T11:52:06Z)
An Interpretable Multiple-Instance Approach for the Detection of referable Diabetic Retinopathy from Fundus Images [72.94446225783697]
基礎画像における参照糖尿病網膜症検出のための機械学習システムを提案する。画像パッチから局所情報を抽出し,アテンション機構により効率的に組み合わせることで,高い分類精度を実現することができる。我々は,現在入手可能な網膜画像データセットに対するアプローチを評価し,最先端の性能を示す。
論文参考訳（メタデータ） (2021-03-02T13:14:15Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。