Fugu-MT 論文翻訳(概要): Quantifying Hallucinations in Language Language Models on Medical Textbooks

論文の概要: Quantifying Hallucinations in Language Language Models on Medical Textbooks

arxiv url: http://arxiv.org/abs/2603.09986v1
Date: Thu, 12 Feb 2026 16:16:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-15 16:38:22.527296
Title: Quantifying Hallucinations in Language Language Models on Medical Textbooks
Title（参考訳）: 医学教科書における言語モデルにおける幻覚の定量化
Authors: Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman,
Abstract要約: 教科書によるQAにおける幻覚の発生頻度と,QAに対する反応がモデルによって異なるのかを問う。実験1ではLLaMA-70B-Instructが19.7%(95% CI 18.6～20.7)で幻覚した。
参考スコア（独自算出の注目度）: 5.868116026339879
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and ,2 respectively
Abstract（参考訳）: 大規模言語モデルが事実的かつ不正確な主張を回答する傾向にある幻覚は、自然言語処理において深刻な問題であり、それに対して効果的な解決策がまだない。既存のQAのベンチマークでは、固定された証拠源に対してこの行動を評価することはめったにない。教科書によるQAにおける幻覚の発生頻度と,QAに対する反応がモデルによって異なるのかを問う。新規プロンプトを付与した医療用QAにおける著名なオープンソース大規模言語モデル(LLaMA-70B-インストラクション)に対する幻覚の有病率を決定する第1実験と,モデル応答に対する幻覚の有病率と臨床選択性を決定する第2実験の2つの実験を行った。 LLaMA-70B-Instruct Hallucinated in 19.7\% (95\% CI 18.6 - 20.7) while 98.8\% of prompt response received maximal plausibility, and observed in experiment two, across model, lower hallucination rate with higher usefulness scores (ρ=-0.71$, $p=0.058$)。臨床医は、それぞれ実験1と実験2に対して、高い合意(4次重みのκ=0.92$)と$τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$)を作成した。

論文の概要: Quantifying Hallucinations in Language Language Models on Medical Textbooks

関連論文リスト