Fugu-MT 論文翻訳(概要): ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

論文の概要: ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

arxiv url: http://arxiv.org/abs/2606.14697v1
Date: Fri, 12 Jun 2026 17:58:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:43.029163
Title: ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
Title（参考訳）: ClinHallu:医療MLLM推論におけるステージワイズ幻覚の診断基準
Authors: Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu,
Abstract要約: 医療MLLM推論における段階的幻覚診断のためのベンチマークであるClinHalluを紹介する。 ClinHalluには7,031の検証済みインスタンスが含まれており、各インスタンスはVisual Recognition、Knowledge Recall、Reasoning Integrationへの構造化推論トレースで拡張されている。トレーサライズされた微調整がステージワイド幻覚を減少させることを示す。
参考スコア（独自算出の注目度）: 37.1442121485284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.
Abstract（参考訳）: 信頼性の高い医療マルチモーダル大言語モデル(MLLM)の構築は、信頼性の高い臨床診断支援に不可欠である。既存の医学幻覚ベンチマークは、主にデータ収集に焦点を当てているが、しばしば推論プロセスの中で幻覚がどこから来たのかを無視する。幻覚源は、視覚的誤認識、誤った医療知識のリコール、欠陥のある推論統合から生じる可能性がある。ソースレベルの幻覚診断を可能にするため,医療MLLM推論における段階的幻覚診断のためのベンチマークであるClinHalluを導入する。 ClinHalluには7,031の検証済みインスタンスが含まれており、各インスタンスはVisual Recognition、Knowledge Recall、Reasoning Integrationに分解された構造化された推論トレースで拡張されている。特定のステージの修正が最終回答にどのように影響するかを測定するために、ステージ置換の介入も使用します。評価の他に、トレース管理による微調整がステージワイド幻覚を減少させることを示す。 ClinHalluは、医学MLLMにおける推論障害の診断と緩和のための、きめ細かい幻覚検査ベッドを提供する。ベンチマークはhttps://github.com/alibaba-damo-academy/ClinHalluで公開されている。

関連論文リスト

MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models [81.64135119165277]
幻覚は臨床的意思決定を阻害し、診断や治療を害する可能性がある。本稿では,医療用VLMにおける幻覚の評価と緩和を目的とした大規模ベンチマークであるMedHallTuneを提案する。我々は、MedHallTuneを用いて、現在の医用および一般のVLMの総合的な評価を行い、臨床精度、関連性、ディテールレベル、リスクレベルなど、主要な指標でそれらの性能を評価する。
論文参考訳（メタデータ） (2025-02-28T06:59:49Z)
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models [22.30139330566514]
大規模ビジョン言語モデル(LVLM)は、医療アプリケーションにとってますます不可欠なものになっている。 LVLMは幻覚への感受性を継承する。幻覚検出と評価に特化して設計された最初のベンチマークであるMed-HallMarkを紹介する。また,正確な幻覚検出のための医療用LVLMであるMedHallDetectorも紹介した。
論文参考訳（メタデータ） (2024-06-14T17:14:22Z)
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation [19.318217051269382]
大規模言語モデル(LLM)は自然言語処理(NLP)の分野で大きく進歩した。 HalluDialは、対話レベルの幻覚自動評価のための、初めての総合的な大規模ベンチマークである。ベンチマークには4,094の対話があり、合計146,856のサンプルが含まれている。
論文参考訳（メタデータ） (2024-06-11T08:56:18Z)
Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment [52.43197107069751]
MLLM(Multimodal Large Language Models)はしばしば幻覚と呼ばれる事実的不正確な情報を生成する。そこで,本研究では,MLLMの命令調整による幻覚の緩和に応用可能な新しい損失であるData-augmented Phrase-level Alignment(DPA)を提案する。
論文参考訳（メタデータ） (2024-05-28T23:36:00Z)
DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models [26.289847386286446]
本稿では,対話レベルの幻覚評価ベンチマークDiaHaluを提案する。収集したトピックをシステムプロンプトに統合し、2つのChatGPT3.5間の対話を促進する。人間の言語規則に従わない内容を手動で修正し、LLMを再生させ、人間と機械の相互作用のシナリオをシミュレートする。
論文参考訳（メタデータ） (2024-03-01T15:38:55Z)
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models [134.6697160940223]
幻覚は、大きな言語モデルの信頼できるデプロイには大きな課題となります。幻覚(検出)の検出方法、LLMが幻覚(ソース)をなぜ検出するのか、そしてそれを緩和するために何ができるか、という3つの重要な疑問がよく研究されるべきである。本研究は, 幻覚検出, 発生源, 緩和の3つの側面に着目した, LLM幻覚の系統的研究である。
論文参考訳（メタデータ） (2024-01-06T12:40:45Z)
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
機械生成データに固有の幻覚は未発見のままである。本稿では,クロスチェックパラダイムに基づく新しい幻覚検出・除去フレームワークであるHaluciDoctorを提案する。 LLaVAに比べて44.6%の幻覚を緩和し,競争性能を維持した。
論文参考訳（メタデータ） (2023-11-22T04:52:58Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。