Fugu-MT 論文翻訳(概要): MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

論文の概要: MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

arxiv url: http://arxiv.org/abs/2605.12361v1
Date: Tue, 12 May 2026 16:32:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:57.013545
Title: MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
Title（参考訳）: MedHopQA: LLMに基づくバイオメディカル質問応答のための病気中心型マルチホップ推論ベンチマークと評価フレームワーク
Authors: Rezarta Islamaj, Robert Leaman, Joey Chan, Nicholas Wan, Qiao Jin, Natalie Xie, John Wilbur, Shubo Tian, Lana Yeganova, Po-Ting Lai, Chih-Hsuan Wei, Yifan Yang, Yao Ge, Qingqing Zhu, Zhizheng Wang, Zhiyong Lu,
Abstract要約: MedHopQAは、1000の専門家による質問応答ペアからなる、疾患中心のマルチホップ推論ベンチマークである。各質問はウィキペディアの異なる2つの記事にまたがる情報の合成を必要とし、回答はオープンエンドのフリーテキスト形式で提供される。
参考スコア（独自算出の注目度）: 11.3883842897598
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
Abstract（参考訳）: バイオメディカル領域における大規模言語モデル (LLM) の評価には、パターンマッチングと推論を区別し、モデル能力の向上とともに識別性を維持するベンチマークが必要である。既存のQAベンチマークは、この点で制限されている。複数の選択形式は、推論よりも解答によるモデルの成功を可能にする一方で、広く流通する試験スタイルのデータセットは、パフォーマンス飽和やデータ汚染のトレーニングにますます脆弱である。マルチホップ推論は、複数のソースにまたがって情報を統合して回答を導き出す能力として定義され、診断支援、文献に基づく発見、仮説生成といった臨床的に意味のあるタスクの中心であるが、現在の生物医学的QAベンチマークでは未定である。 MedHopQAは、BioCreative IXで共有タスクとして導入された1000の専門家による質問応答ペアからなる、疾患中心のマルチホップ推論ベンチマークである。各質問はウィキペディアの異なる2つの記事にまたがる情報の合成を必要とし、回答はオープンエンドのフリーテキスト形式で提供される。金のアノテーションは、Mondo、NCBI Gene、NCBI Taxonomyのオントロジーに基づくシノニムセットで強化され、語彙と概念レベルの評価の両方をサポートする。 MedHopQAは、人間のアノテーション、トリアージ、反復検証、LCM-as-a-judgeバリデーションを組み合わせた構造化プロセスによって構築された。リーダーボードのゲームと汚染リスクを減らすため、1000件の質問は、公開ダウンロード可能な1万件の質問に埋め込まれ、回答はそのままCodaBenchのリーダーボードに埋め込まれている。 MedHopQAは、将来のバイオメディカルQAデータセットを構築するためのベンチマークと再利用可能なフレームワークの両方を提供する。

関連論文リスト

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering [8.26744997684193]
BioCreative IX MedHopQA共有タスクは、大規模言語モデル(LLM)のマルチホップ推論でベンチマークするために設計された。我々は、疾患、遺伝子、化学物質にまたがる1000のQAペアからなる新しいデータセットを開発した。それぞれの質問は、ウィキペディアの2つのページからの情報を統合することによって、2つのホップ推論を必要とするように構築された。
論文参考訳（メタデータ） (2026-05-12T15:59:28Z)
BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models [7.8780007697387235]
本稿では,新たに公開されたバイオメディカル文書からの質問に答える上で,大規模言語モデル(LLM)を評価するベンチマークであるBioPulse-QAを紹介する。 GPT-o1, GPT-o1, Gemini-2.0-Flash, LLaMA-3.1 8B の4つの LLM の評価を行った。
論文参考訳（メタデータ） (2026-01-19T00:38:33Z)
UETQuintet at BioCreative IX - MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval [0.0]
直接質問と逐次質問の両方を効果的に解決するモデルを提案する。マルチソース情報検索とインコンテキスト学習を利用して、回答を生成するためのリッチで関連するコンテキストを提供する。当社のアプローチでは,Exact Matchスコア0.84を達成し,現行のリーダボードで2位となった。
論文参考訳（メタデータ） (2026-01-11T16:12:38Z)
MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI [66.0701326117134]
MedForgetは、階層型を意識したマルチモーダルなアンラーニングテストベッドで、準拠する医療AIシステムを構築する。既存の手法は,診断性能を低下させることなく,完全かつ階層性に配慮した忘れの解決に苦慮していることを示す。階層レベルのコンテキストをプロンプトに徐々に追加する再構成攻撃を導入する。
論文参考訳（メタデータ） (2025-12-10T17:55:06Z)
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive [50.468138755368805]
オピオイド危機は公衆衛生にとって重要な瞬間である。 UCSF-JHU Opioid Industry Documents Archive(OIDA)に公開されているデータと文書本稿では,文書属性に応じて元のデータセットを整理することで,この問題に対処する。
論文参考訳（メタデータ） (2025-11-13T03:27:32Z)
CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA [3.222047196930981]
大規模言語モデル(LLM)は、様々な領域にわたる正確な質問応答において、ますます明白になっている。本稿では,BioCreative IX共有タスクのMedHopQAトラックへのアプローチについて述べる。短い解答と長い解答を組み合わせた微調整、短い解答のみ、長い解答のみの3つの実験的な設定が検討されている。
論文参考訳（メタデータ） (2025-08-31T11:40:02Z)
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use [10.565661515629412]
MedBrowseCompは、エージェントが医療事実を検索し、合成する能力を体系的にテストするベンチマークである。臨床シナリオを反映した1,000以上の人為的な質問が含まれている。 MedBrowseCompをフロンティアエージェントシステムに適用すると、パフォーマンスの欠点が10%も低くなる。
論文参考訳（メタデータ） (2025-05-20T22:42:33Z)
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA は、生物学の専門家が様々な顕微鏡のモードでキュレートした 1,042 の多重選択質問 (MCQ) から構成される。最先端のMLLMのベンチマークでは、ピーク性能は53%であった。チェーン・オブ・シント・レスポンスのエキスパート分析では、知覚エラーが最も頻繁であり、続いて知識エラー、そして過一般化エラーが続く。
論文参考訳（メタデータ） (2025-03-17T17:33:10Z)
ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
質問回答(QA)は、言語モデルの推論と知識の深さを効果的に評価する。化学QAは、複雑な化学情報を理解しやすい形式に効果的に翻訳することで、教育と研究の両方において重要な役割を担っている。このデータセットは、不均衡なデータ分散や、潜在的に有用である可能性のあるかなりの量の未ラベルデータを含む、典型的な現実世界の課題を反映している。収集したデータを完全に活用して,化学的な問題に効果的に答えるQAMatchモデルを提案する。
論文参考訳（メタデータ） (2024-07-24T01:46:55Z)
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering [89.76059961309453]
HeadQAデータセットには、公衆医療専門試験で認可された複数の選択質問が含まれている。これらの質問は、現在のQAシステムにとって最も難しいものです。知識抽出フレームワーク(MurKe)を用いた多段階推論を提案する。市販の事前訓練モデルを完全に活用しようと努力しています。
論文参考訳（メタデータ） (2020-08-06T02:47:46Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。