Fugu-MT 論文翻訳(概要): Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

論文の概要: Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

arxiv url: http://arxiv.org/abs/2606.01301v1
Date: Sun, 31 May 2026 15:43:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.563657
Title: Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning
Title（参考訳）: Med-Heal:幻覚を意識した医学LLMにおける幻覚の分析と緩和
Authors: Yiming Liao, Zeno Franco, Jose Eduardo Lizarraga Mazaba, Keke Chen,
Abstract要約: 医学的大言語モデルにおける幻覚は臨床決定支援に重大なリスクをもたらす。医療用LLMの幻覚を系統的に同定し,分析し,緩和するためのフレームワークであるMed-HEALを紹介する。
参考スコア（独自算出の注目度）: 8.322191814123315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucinations in medical large language models (LLMs) pose serious risks for clinical decision support, particularly when models must reason over complex electronic health records (EHRs). However, existing benchmarks often lack a realistic clinical context and provide limited insight into how hallucinations can be mitigated in practice. We introduce Med-HEAL, a framework for systematically identifying, analyzing, and mitigating hallucinations in medical LLMs using clinically grounded data. Building on the EHRNoteQA benchmark derived from MIMIC-IV discharge summaries, we construct a hallucination dataset by evaluating BioMistral-7B on open-ended clinical question answering tasks. Model outputs are labeled through a dual evaluation pipeline that combines LLM-as-a-Judge assessment (GPT-4o) with human auditing by medical student reviewers, producing correctness judgments and annotations of reasoning errors via a custom web-based evaluation system. We then leverage this dataset to investigate mitigation strategies: a self-critique pipeline, in which the test model reviews its own answers to detect potential errors and regenerates responses for flagged cases, and retrieval-augmented in-context learning (RA-ICL), which exposes the model to hallucinated and corrected examples. Experiments across five open-source LLMs-BioMistral, Llama-3.1, DeepSeek, Qwen2.5, and Qwen3, show that the self-critique strategy improves accuracy for three of five models (p < 0.05) without requiring parameter updates. Med-HEAL provides both a reusable hallucination dataset and a practical framework for studying and mitigating hallucinations in medical LLMs, supporting safer deployment of AI systems in clinical environments. Our code and data are publicly available at https://github.com/yimingliao-blad/med-heal.git.
Abstract（参考訳）: 医学大言語モデル(LLMs)の幻覚は、特に複雑な電子健康記録(EHRs)をモデルが引き起こさなければならない場合、臨床決定支援に重大なリスクをもたらす。しかし、既存のベンチマークでは現実的な臨床状況が欠如しており、幻覚が実際にどのように緩和されるかについての限られた洞察を与えている。 Med-HEAL(メド・ヘラル、Med-HEAL)は、臨床基礎データを用いて、医療用LLMの幻覚を系統的に同定し、分析し、緩和するためのフレームワークである。 MIMIC-IV 放電サマリーから得られた EHRNoteQA ベンチマークに基づいて,オープンエンド臨床質問応答タスクにおける BioMistral-7B の評価による幻覚データセットを構築した。モデル出力は、LLM-as-a-Judgeアセスメント(GPT-4o)と医学生による人間監査を組み合わせた二重評価パイプラインを通じてラベル付けされ、カスタムWebベースの評価システムを通じて正確性判定と推論エラーのアノテーションを生成する。自己批判パイプラインでは、テストモデルが自身の回答をレビューして潜在的なエラーを検出し、フラグ付きケースに対する応答を再生し、検索強化されたコンテキスト内学習(RA-ICL)により、モデルが幻覚的および修正された例に公開する。オープンソースのLLMs-BioMistral、Llama-3.1、DeepSeek、Qwen2.5、Qwen3の5つの実験は、自己批判戦略がパラメータ更新を必要とせずに5つのモデルのうち3つの精度(p < 0.05)を改善することを示した。 Med-HEALは、再利用可能な幻覚データセットと、医療用LLMにおける幻覚の研究と緩和のための実践的なフレームワークの両方を提供し、臨床環境におけるAIシステムの安全な展開をサポートする。私たちのコードとデータはhttps://github.com/yimingliao-blad/med-heal.git.comで公開されています。

論文の概要: Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

関連論文リスト