Fugu-MT 論文翻訳(概要): CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

論文の概要: CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

arxiv url: http://arxiv.org/abs/2605.05810v1
Date: Thu, 07 May 2026 07:46:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.604199
Title: CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Title（参考訳）: CXR-ContraBench:医療用VLMにおける負のオプティオンアトラクションのベンチマーク
Authors: Zhengru Fang, Yanan Ma, Yu Guo, Senkang Hu, Yixian Zhang, Hangcheng Cao, Wenbo Ding, Yuguang Fang,
Abstract要約: CXR-ContraBenchは、内部のReXVQAスライスと外部のOpenIおよびCheXpertプロトコルにまたがる診断ベンチマークである。我々は,この失敗を,視覚的証拠と疑問の両方に矛盾する場合でも,否定的回答オプションにモデルを引き付けるという,否定的選択の誘因として研究する。
参考スコア（独自算出の注目度）: 20.96410413299322
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer "No consolidation." This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction Benchmark), a diagnostic benchmark spanning internal ReXVQA slices and external OpenI and CheXpert protocols. The benchmark centers on present-finding questions, where selecting "No X" despite visible X creates the main clinical risk, and uses absent-finding questions as secondary tests of whether models copy negated wording. Across CheXpert protocols, the failure is substantial and persistent. On a strict direct presence probe, MedGemma and Qwen2.5-VL reach only 31.49% and 30.21% accuracy, respectively; on a matched 135,754-record CheXpert training-split protocol, both models select negated options on over 62% of presence questions. Chain-of-thought prompting reduces some presence-side reversals but does not eliminate them and can amplify absence-side contradictions. Finally, QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) deterministically repairs the measured polarity-confused subset without retraining, raising MedGemma and Qwen2.5-VL to 96.60% and 95.32% accuracy on the direct presence probe. These results show that standard accuracy can hide a clinically meaningful inference-time polarity failure. Source code and benchmark construction scripts are available at https://github.com/fangzr/cxr-contrabench-code.
Abstract（参考訳）: 胸部X線が凝固を示すが、どの発見があるのかを問うと、医療ビジョン言語モデルが「凝固しない」と答えることがある。これは単に誤った選択ではなく、画像に矛盾する臨床声明を出力する極性反転である。我々は,この失敗を,視覚的証拠と疑問の両方に矛盾する場合でも,否定的回答オプションにモデルを引き付けるという,否定的選択の誘因として研究する。我々は、内部ReXVQAスライスと外部OpenIおよびCheXpertプロトコルにまたがる診断ベンチマークであるCXR-ContraBench(Chest X-Ray Contradiction Benchmark)を紹介する。このベンチマークは、目に見えるXにもかかわらず"No X"を選択することで主要な臨床リスクを生じさせ、モデルが否定された単語をコピーするかどうかの二次的なテストとして、未定義の質問を使用する。 CheXpertプロトコル全体では、障害は相当で永続的である。厳密な直接プレゼンスプローブでは、MedGemmaとQwen2.5-VLはそれぞれ31.49%と30.21%の精度に達し、一致した135,754のCheXpertトレーニングスプリットプロトコルでは、両方のモデルが62%以上のプレゼンス質問に対して無効オプションを選択する。思考の連鎖は、いくつかの存在側の逆転を減少させるが、それらを排除せず、不在側の矛盾を増幅する。最後に、QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) は、測定された極性分解部分集合を再トレーニングせずに決定的に修復し、MedGemma と Qwen2.5-VL を96.60% と95.32% に引き上げる。これらの結果から,標準精度は臨床的に有意な推測時間極性障害を隠蔽する可能性が示唆された。ソースコードとベンチマーク構築スクリプトはhttps://github.com/fangzr/cxr-contrabench-codeで入手できる。

論文の概要: CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

関連論文リスト