Fugu-MT 論文翻訳(概要): MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

論文の概要: MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

arxiv url: http://arxiv.org/abs/2605.07919v1
Date: Fri, 08 May 2026 15:55:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.180054
Title: MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
Title（参考訳）: MedVIGIL: 信頼できる医療用VLMの評価
Authors: Hanqi Jiang, Junhao Chen, Yi Pan, Lifeng Chen, Weihang You, Haozhen Gong, Ruiyu Yan, Jinglei Lv, Lin Zhao, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li,
Abstract要約: medvigilは4つの公開医療用VQAソースから作成された300ケースの評価スイートである。あらゆるゴールド回答、拒絶オプション、候補答えセット、パラフレーズ、虚偽の前提トラップ、ROIボックス、臨床リスクレベルが臨床著者によって作成される。メドビニル複合スコア(MCS)に要約した7つの正当性条件監査指標を報告する。独立した放射線学者は、静電気障害率5.8%でMCS 83.3をスコアし、14.1ポイントの複合ヘッドルームを最強の監査モデルの上に残した。
参考スコア（独自算出の注目度）: 24.517280048376758
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
Abstract（参考訳）: 医用視覚言語モデル(VLM)は、通常、画像検索ペアで評価されるが、信頼できる臨床使用には、より強い特性が必要である。そこでは,視覚に要求される医学的問題と誤った前提,言葉の摂動,知識のみの書き直し,ROIの破損した画像とを組み合わせ,そのモデルが非拒否的回答を返す。我々は,4つの公衆医療用VQAソースから作成した300ケース評価スイートであるmedvigilについて,すべてのゴールド回答,拒絶オプション,候補回答セット,パラフレーズ,偽装トラップ,ROIボックス,臨床リスクレベルが臨床著者によって管理されている。 2人の放射線科医が全ての事例に平行して注釈を付け、上級放射線科医が解放されたマニフェストを集約し、別の4番目の放射線科医が調査の全ての回答から独立して人間の基準基準線を提供する。リリースには、2{,}556のMCQプローブ、240の対物三つ子、医師が指定したリスク層と応答性フラグ、ROIボックス、ペア化されたオープンエンド版が含まれている。メドヴィジル複合スコア(MCS)に要約した7つの正当性条件監査指標と16の視覚能力モデルと2つのテキストのみのベースラインについて報告する。独立した放射線学者は、MCS 83.3をサイレント・フェイルレート5.8%でスコアし、14.1ポイントの複合ヘッドルームを最強の監査モデル(Claude Opus 4.7 at 69.2)の上に残した。ベンチマークと評価ハーネスが公開されている。

論文の概要: MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

関連論文リスト