Fugu-MT 論文翻訳(概要): EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

論文の概要: EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

arxiv url: http://arxiv.org/abs/2509.14977v1
Date: Thu, 18 Sep 2025 14:07:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.255438
Title: EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Title（参考訳）: EchoVLM:Universal Ultrasound Intelligenceのためのダイナミック・ミックス・オブ・エクササイズビジョン・ランゲージモデル
Authors: Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang,
Abstract要約: 本稿では,超音波医療画像に特化して設計された視覚言語モデルであるEchoVLMを提案する。このモデルは、7つの解剖学的領域にまたがるデータに基づいてトレーニングされたMixture of Experts (MoE)アーキテクチャを採用している。 EchoVLMは、それぞれBLEU-1スコアとROUGE-1スコアで10.15点と4.77点を大きく改善した。
参考スコア（独自算出の注目度）: 9.731550105507457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.
Abstract（参考訳）: 超音波イメージングは、非電離放射線、低コスト、リアルタイムイメージング能力の利点から、早期がん検診において好まれる画像モダリティとなっている。しかし、従来の超音波診断は医師の専門知識に大きく依存しており、高い主観性と診断効率の低い課題が提示されている。視覚言語モデル(VLM)は、この問題に対して有望な解決策を提供するが、既存の汎用モデルは、多臓器病変の認識における一般化の欠如やマルチタスク診断における低効率といった、超音波医療における限られた知識を示している。これらの制約に対処するため,超音波医療画像に特化して設計された視覚言語モデルであるEchoVLMを提案する。このモデルは、7つの解剖学的領域にまたがるデータに基づいてトレーニングされたMixture of Experts (MoE)アーキテクチャを採用している。この設計により、超音波レポート生成、診断、視覚質問応答(VQA)など、複数のタスクを実行できる。実験の結果, 超音波レポート生成タスクにおけるQwen2-VLと比較して, BLEU-1スコアでは10.15点, ROUGE-1スコアでは4.77点, それぞれ有意な改善が得られた。これらの結果から,EchoVLMは超音波画像診断における診断精度を高める可能性が示唆された。ソースコードとモデルの重み付けはhttps://github.com/Asunatan/EchoVLMで確認できる。

論文の概要: EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

関連論文リスト