Fugu-MT 論文翻訳(概要): Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

論文の概要: Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

arxiv url: http://arxiv.org/abs/2604.02207v1
Date: Thu, 02 Apr 2026 15:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.90683
Title: Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Title（参考訳）: LLMによる胸部CTの日本語翻訳のブラインドドラジオロジーとLCMによる評価 : 比較検討
Authors: Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe,
Abstract要約: 今回,CT-RATE-JPNによる150例の胸部CT所見について検討した。英語のレポートでは、DeepSeek-V3.2によるLLMによる日本語翻訳と比較された。専門の放射線科医と放射線科医は,専門用語の正確性,可読性,総合的品質,放射線学スタイルの信頼性の4つの基準において,個別に視覚的一対評価を行った。
参考スコア（独自算出の注目度）: 0.0177677587528917
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.
Abstract（参考訳）: 背景: 多言語研究, 臨床コミュニケーション, 放射線学教育において, 正確な放射線学報告の翻訳が重要であるが, LLMによる評価の有効性は明らかでない。目的: 胸部CT画像の日本語翻訳の教育的適合性を評価し, 放射線学的評価とLCM-as-a-judge評価を比較した。方法: CT-RATE-JPN 検査セットから150例の胸部CT所見を解析した。英語のレポートでは、DeepSeek-V3.2によるLLMによる日本語翻訳と比較された。専門の放射線科医と放射線科医は,専門用語の正確性,可読性,総合的品質,放射線学スタイルの信頼性の4つの基準において,個別に視覚的一対評価を行った。並行して、3人のLLM審査員(DeepSeek-V3.2、Mistral Large 3、GPT-5)が同じペアを評価した。合意はQWKとパーセンテージ契約を用いて評価された。結果: 放射線技師とLLM判事の合意は, ほぼゼロ(QWK=-0.04～0.15)であった。 2人の放射線学者の合意も不十分であった(QWK=0.01から0.06)。放射線技師1は59%の症例で用語を同等と評価し,可読性(51%)と全体的な品質(51%)のLLM翻訳を好んだ。放射線学者2人は、可読性は75%の症例で同等と評価し、全体的な品質(40%対21%)で人文翻訳を好んだ。 LLMの3人の審査員は、全ての基準(70%-99%)でLSM翻訳を強く好んでおり、93%の症例では放射線科に類似していると評価した。結論: LLMが生成した翻訳は自然で流動的であると判断されることが多いが、2人の放射線学者は著しく異なる。 LLM-as-a-judge は LLM の出力を強く好ましく, 放射線学者との合意が得られなかった。翻訳放射線学レポートの教育的利用には,LSMによる自動評価だけでは不十分である。

論文の概要: Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

関連論文リスト