Fugu-MT 論文翻訳(概要): MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

論文の概要: MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

arxiv url: http://arxiv.org/abs/2605.19359v1
Date: Tue, 19 May 2026 04:42:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.129197
Title: MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
Title（参考訳）: MAM-CLIP: BI-RADS分類のためのマンモグラフィー・アトラスによる視線訓練
Authors: Halil Ibrahim Gulluk, Olivier Gevaert,
Abstract要約: 深層学習法は,マンモグラフィ画像からBI-RADSスコアを予測する上で有望な結果を示した。 2313枚のマンモグラフィー画像とそれに対応するキャプションを2つのマンモグラフィーアトラスから収集した。 BI-RADS予測のための2つのデータセット上でビジョンエンコーダを微調整し、この事前トレーニングなしでトレーニングされたモデルと比較して優れた性能を実現する。
参考スコア（独自算出の注目度）: 2.7579377082303673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP
Abstract（参考訳）: 深層学習法はマンモグラフィー画像からBI-RADSスコアを予測する上で有望な結果を示した。しかし、これらの画像の解釈は様々であり、放射線学者の間でも相違が生じている。マンモグラムの本質的な複雑さを考えると、画像ラベルにのみ依存するトレーニング分類モデルは、しばしば限られた性能をもたらす。この課題に対処するため,2313個のマンモグラフィー画像とそれに対応するキャプションを2つのマンモグラフィーアトラスから収集した。提案手法では,事前学習したPubMedBERTを言語コンポーネントとして利用するマルチモーダルモデルを用いている。コントラスト学習を伴う画像テキストペア上でこのモデルをトレーニングすることにより、視覚エンコーダはキャプションに含まれる豊富な情報を吸収し、マンモグラフィー所見の理解を向上させることができる。次に、BI-RADS予測のための2つのデータセット上でビジョンエンコーダを微調整し、特にラベル付きサンプルが不足している場合、この事前トレーニングなしでトレーニングされたモデルと比較して優れた性能を達成する。 3クラスの平均F1スコアの改善は+1%から+14%に、a+1%は40Kのトレーニングサンプルで増加し、+14%は1Kのトレーニングサンプルで上昇した。さらに, マンモグラフィーアトラスから得られた2K画像テキストペアは, ラベル予測のための2Kラベル付きサンプルよりも有意であり, 10K以上のトレーニングサンプルが利用できる場合の平均マージンは +1.1% であることがわかった。全体として,マンモグラフィーの視覚言語モデルを提供し,マンモグラフィーアトラスからのテキスト情報の価値を強調した。さらに,TEKNOFESTデータセットの事前処理したマンモグラフィ画像も公開している。トレーニングコード、事前トレーニングされたモデルウェイト、データ抽出スクリプト、およびリリースされたデータセットは、https://github.com/igulluk/MAM-CLIPで公開されている。

論文の概要: MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

関連論文リスト