Fugu-MT 論文翻訳(概要): Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays

論文の概要: Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays

arxiv url: http://arxiv.org/abs/2509.15234v1
Date: Wed, 17 Sep 2025 09:44:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:10.803046
Title: Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays
Title（参考訳）: 胸部X線画像検索のためのLCMエンコーダの機能探索
Authors: Hanbin Ko, Gihun Cho, Inhyeok Baek, Donguk Kim, Joonbeom Koo, Changi Kim, Dongheon Lee, Chang Min Park,
Abstract要約: 視覚言語による事前訓練は画像とテキストのアライメントが進んでいるが、臨床報告の不均一性によって放射線学の進歩が制限されている。我々は,大規模言語モデル (LLM) エンコーダが,多様なスタイルにまたがる堅牢な臨床表現を提供できるかどうかを問う。胸部X線レポート用のドメイン適応エンコーダLLM2VEC4CXRと、このエンコーダとビジョンバックボーンを結合するデュアルトウワーフレームワークLLM2CLIP4CXRを紹介する。
参考スコア（独自算出の注目度）: 8.019362739504087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language pretraining has advanced image-text alignment, yet progress in radiology remains constrained by the heterogeneity of clinical reports, including abbreviations, impression-only notes, and stylistic variability. Unlike general-domain settings where more data often leads to better performance, naively scaling to large collections of noisy reports can plateau or even degrade model learning. We ask whether large language model (LLM) encoders can provide robust clinical representations that transfer across diverse styles and better guide image-text alignment. We introduce LLM2VEC4CXR, a domain-adapted LLM encoder for chest X-ray reports, and LLM2CLIP4CXR, a dual-tower framework that couples this encoder with a vision backbone. LLM2VEC4CXR improves clinical text understanding over BERT-based baselines, handles abbreviations and style variation, and achieves strong clinical alignment on report-level metrics. LLM2CLIP4CXR leverages these embeddings to boost retrieval accuracy and clinically oriented scores, with stronger cross-dataset generalization than prior medical CLIP variants. Trained on 1.6M CXR studies from public and private sources with heterogeneous and noisy reports, our models demonstrate that robustness -- not scale alone -- is the key to effective multimodal learning. We release models to support further research in medical image-text representation learning.
Abstract（参考訳）: 視覚言語による事前訓練は、画像テキストのアライメントが進んでいるが、放射線学の進歩は、略語、印象のみのメモ、スタイリスティックな変動など、臨床報告の不均一性によって制限されている。より多くのデータがパフォーマンス向上に繋がる一般的なドメイン設定とは異なり、ノイズの多いレポートの大規模なコレクションへの自動スケーリングは、モデル学習を安定させるか、あるいは低下させる可能性がある。我々は,大規模言語モデル (LLM) エンコーダが,多種多様なスタイルにまたがる堅牢な臨床表現を提供し,画像・テキストのアライメントを向上するか否かを問う。胸部X線レポート用のドメイン適応LLMエンコーダであるLLM2VEC4CXRと、このエンコーダとビジョンバックボーンを結合するデュアルトウワーフレームワークであるLLM2CLIP4CXRを紹介する。 LLM2VEC4CXRはBERTベースのベースラインに対する臨床テキスト理解を改善し、略語やスタイルの変化を処理し、レポートレベルの指標に強い臨床アライメントを実現する。 LLM2CLIP4CXRは、これらの埋め込みを活用して、検索精度と臨床指向のスコアを向上し、以前のCLIPよりも強力なクロスデータセットの一般化を行う。異質でノイズの多いレポートを持つ公開およびプライベートソースから1.6MのCXR調査をトレーニングした結果、ロバスト性(スケールのみではない)が効果的なマルチモーダル学習の鍵であることを、我々のモデルは示しています。医用画像テキスト表現学習のさらなる研究を支援するモデルをリリースする。

論文の概要: Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays

関連論文リスト