Fugu-MT 論文翻訳(概要): BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

論文の概要: BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

arxiv url: http://arxiv.org/abs/2604.26048v1
Date: Tue, 28 Apr 2026 18:33:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.14402
Title: BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
Title（参考訳）: BioGraphletQA: 複雑なQAデータセットの知識アンコール生成
Authors: Richard A. A. Jonker, Bárbara Maria Ribeiro de Abreu Martins, Sérgio Matos,
Abstract要約: 本稿では,QA(complex Question Answering)データを生成するための原則的フレームワークを提案する。このフレームワークの中核は、構造化されたプロンプトで知識グラフ(KG)からの小さなサブグラフが使用されるグラフレットアンコール生成プロセスである。このフレームワークの最初のインスタンス化はBioGraphletQAで、119,856のQAペアからなる新しいバイオメディカルなKGQAデータセットである。
参考スコア（独自算出の注目度）: 0.3058685580689604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.
Abstract（参考訳）: 本稿では,複雑な質問応答(QA)データを体系的に生成する,原則的かつスケーラブルなフレームワークを提案する。このフレームワークのコアとなるのは、知識グラフ(KG)からの小さなサブグラフを構造化プロンプトで使用して、複雑さを制御し、大規模言語モデルによって生成された質問の事実的根拠を確保する、グラフレットアンコール生成プロセスである。このフレームワークの最初のインスタンス化はBioGraphletQAで、119,856のQAペアからなる新しいバイオメディカルなKGQAデータセットである。各エントリは、OREGANO KGから最大5ノードのグラフレットに格納されており、ほとんどのペアはPubMedの関連ドキュメントスニペットで濃縮されている。まず、フレームワークの価値とデータセットの品質を、106のQAペアのドメインエキスパートによる評価によって実証し、生成されたデータの科学的妥当性と複雑さを確認することから始めます。次に,低リソース環境でのPubMedQAの49.2%から68.5%,フルリソース環境でのMedQAの41.4%から44.8%の精度向上を示す。我々のフレームワークは、MCQAやKGQAを含む複雑なQAタスクを前進させるために重要なリソースを作成するための堅牢で一般化可能なソリューションを提供する。データセット(https://zenodo.org/records/17381119)やフレームワークコード(https://github.com/ieeta-pt/BioGraphletQA)など、この作業をサポートするすべてのリソースが、使用、再現性、拡張を容易にするために公開されている。

論文の概要: BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

関連論文リスト