Fugu-MT 論文翻訳(概要): HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

論文の概要: HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

arxiv url: http://arxiv.org/abs/2606.00971v1
Date: Sun, 31 May 2026 03:02:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.009449
Title: HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering
Title（参考訳）: バイオメディカル質問応答のための推論時間アンサーフュージョンと構造化仮説空間レポート
Authors: Md Motaleb Hossen Manik, Ge Wang,
Abstract要約: 本稿では,バイオメディカル質問応答のための推論時間信頼性パイプラインであるPhythesisMedを提案する。直接、チェーン・オブ・シント、仮説Med-v3プロンプトと解答融合を組み合わせたものだ。 MedQA, MedMCQA, PubMedQAにおいて, データセット1,000例を用いてQwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, BioMistral-7Bを評価した。
参考スコア（独自算出の注目度）: 6.396911723204044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.
Abstract（参考訳）: 大規模言語モデルを用いたバイオメディカル質問応答は、解答精度を用いて一般的に評価されるが、解答精度だけでは、モデルが解析可能な出力を生成したり、構造化された信頼性指示に従ったり、弱い解答空間を認識したり、不確実なコミットメントを避けることができるかどうかを示さない。本稿では,バイオメディカルな複数選択質問応答のための推論時間信頼性パイプラインであるPhythesisMedを提案する。直接、チェーン・オブ・シント、仮説Med-v3プロンプトと解答融合を組み合わせたものだ。最終回答は融合によって選択され、PhythesisMed-v3はSPACEラベルと信頼性情報を提供する。 SPACEラベルは、答え空間をVALID、INCOMPLETE、ConTRADICTEDとマークする。 MedQA, MedMCQA, PubMedQAにおいて, データセット1,000例を用いてQwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, BioMistral-7Bを評価した。このパイプラインは、パースとSPACEカバレッジを増大させながら、各モデルの最高のダイレクトまたはチェーンのベースラインよりも重み付け精度を向上させる。また,Qwen2.5-7BとPhi-4-miniを1モデルあたり10,183例で評価した。融合はPhi-4-miniの精度を 0.4296 から 0.5192 に改善するが、Qwen2.5-7B の連鎖は答えの精度がわずかに高いままである。しかし、Qwen2.5-7B融合は完全なパースとSPACEカバレッジを達成し、偽のコミットメントははるかに低い。 12,000サンプルのSPACEストレステストでは、Qwen2.5-7BのSPACE精度は0.3074、Phi-4-miniの0.4168である。これらの結果は,回答の正確性,解析可能性,構造化された信頼性レポート,校正動作,偽コミット動作が分離可能であることを示す。主な貢献は、普遍的な最先端のクレームではなく、構造化された信頼性制約の下で監査可能なワークフローコンポーネントとしてバイオメディカル質問応答モデルを評価する再現可能な推論時間フレームワークである。

論文の概要: HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

関連論文リスト