Fugu-MT 論文翻訳(概要): Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

論文の概要: Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

arxiv url: http://arxiv.org/abs/2509.24739v1
Date: Mon, 29 Sep 2025 13:03:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.994608
Title: Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation
Title（参考訳）: 医療データのためのビジョンランゲージ基礎モデルに向けて:ベトナムのPET/CTレポート生成のためのマルチモーダルデータセットとベンチマーク
Authors: Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen,
Abstract要約: 我々は,1,567,062対のCT-PET画像と,それに対応する2,757人の臨床報告からなるベトナム語多言語医療データセットを新たに導入した。私たちの知る限りでは、ベトナムでPET/CT-レポートペアを包括的に提供する最初のデータセットです。我々は、医療報告生成や視覚的質問応答など、下流タスクにおける最先端VLMのベンチマークを総合的に実施する。
参考スコア（独自算出の注目度）: 14.023732915879336
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.
Abstract（参考訳）: 大規模マルチモーダルデータセットでトレーニングされたビジョンランゲージ財団モデル(VLM)は、リッチなクロスモーダル推論を可能にすることによって、人工知能の大幅な進歩を推進している。一般的な領域での成功にもかかわらず、様々な画像モダリティと多言語臨床データの不足により、これらのモデルを医療画像に適用することは依然として困難である。既存の医療用VLMの多くは画像モダリティのサブセットに基づいて訓練されており、主に高リソース言語に焦点を当てているため、その一般化性と臨床的有用性は制限されている。これらの制約に対処するため,1,567,062対のCT-PET画像と2,757対のフル長臨床報告からなるベトナム語多言語医療データセットを新たに導入した。このデータセットは,(1)既存のVLM訓練コーパスにおけるPET/CT画像データの欠如,(2)医療ビジョン言語研究における低リソース言語,特にベトナム語)の不足,の2つのギャップを埋めるように設計されている。私たちの知る限りでは、ベトナムでPET/CT-レポートペアを包括的に提供する最初のデータセットです。さらに、データ拡張やエキスパート検証テストセットを含む、VLMの学習を強化するためのトレーニングフレームワークを導入します。我々は、医療報告生成や視覚的質問応答など、下流タスクにおける最先端VLMのベンチマークを総合的に実施する。実験結果から,我々のデータセットを組み込むことで,既存のVLMの性能が大幅に向上することが示された。このデータセットとベンチマークは、医療画像、特に低リソース言語におけるより堅牢なVLMの開発を推進し、ベトナムの医療における臨床関連性を改善する上で、重要なステップになると考えています。

論文の概要: Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

関連論文リスト