Related papers: Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

URL: http://arxiv.org/abs/2505.23495v4
Date: Mon, 03 Nov 2025 19:15:16 GMT
Title: Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
Authors: Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma,
Abstract summary: Knowledge Graph Question Answering systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning.<n>Despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues.<n>We introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls.<n>Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
Score: 63.84117489519164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

Related papers

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z)
Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs [3.222543736797976]
SynthKGQA is a framework for generating high-quality synthetic Knowledge Graph Question Answering datasets from any Knowledge Graph.<n>We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.
arXiv Detail & Related papers (2025-11-06T15:45:18Z)
Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching [61.824094419641575]
Large Language Models (LLMs) struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA)<n>We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures.<n>Existing methods usually employ resource-intensive, non-scalable reasoning on vanilla KGs, but overlook this gap.<n>We propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries.
arXiv Detail & Related papers (2025-09-25T06:48:52Z)
KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval [5.263064605350636]
We propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR)<n> KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts.<n> Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR's superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models.
arXiv Detail & Related papers (2025-08-28T04:37:15Z)
KG2QA: Knowledge Graph-enhanced Retrieval-augmented Generation for Communication Standards Question Answering [7.079181644378029]
KG2QA is a question answering framework that integrates fine-tuned large language models (LLMs) with a domain-specific knowledge graph (KG)<n>We construct a high-quality dataset of 6,587 QA pairs from ITU-T recommendations and fine-tune Qwen2.5-7B-Instruct.<n>In our KG-RAG pipeline, the fine-tuned LLMs first retrieves relevant knowledge from KG, enabling more accurate and factually grounded responses.
arXiv Detail & Related papers (2025-06-08T08:07:22Z)
Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models [51.47994645529258]
We propose Question-Aware Knowledge Graph Prompting (QAP), which incorporates question embeddings into GNN aggregation to dynamically assess KG relevance.<n> Experimental results demonstrate that QAP outperforms state-of-the-art methods across multiple datasets, highlighting its effectiveness.
arXiv Detail & Related papers (2025-03-30T17:09:11Z)
MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation [5.525151548786079]
Existing RAG benchmarks often overlook query difficulty, leading to inflated performance on simpler questions and unreliable evaluations.<n>We propose MHTS (Multi-Hop Tree Structure), a novel dataset synthesis framework that controls multi-hop reasoning complexity by leveraging a multi-hop tree structure to generate logically connected, multi-chunk queries.
arXiv Detail & Related papers (2025-03-29T06:26:01Z)
Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service [9.468878976626351]
KGQA systems are complex because the system has to understand the relations and entities in the knowledge-seeking natural language queries.<n>We introduce Chronos, a comprehensive evaluation framework for KGQA at industry scale.<n>It is designed to evaluate such a multi-component system comprehensively, focusing on (1) end-to-end and component-level metrics, (2) scalable to diverse datasets and (3) a scalable approach to measure the performance of the system prior to release.
arXiv Detail & Related papers (2025-01-28T20:02:10Z)
A Universal Question-Answering Platform for Knowledge Graphs [7.2676028986202]
We propose KGQAn, a universal QA system that does not need to be tailored to each target KG. KGQAn is easily deployed and outperforms by a large margin the state-of-the-art in terms of quality of answers and processing time.
arXiv Detail & Related papers (2023-03-01T15:35:32Z)
KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models [76.01814380927507]
KGxBoard is an interactive framework for performing fine-grained evaluation on meaningful subsets of the data. In our experiments, we highlight the findings with the use of KGxBoard, which would have been impossible to detect with standard averaged single-score metrics.
arXiv Detail & Related papers (2022-08-23T15:11:45Z)
Knowledge Base Question Answering by Case-based Reasoning over Subgraphs [81.22050011503933]
We show that our model answers queries requiring complex reasoning patterns more effectively than existing KG completion algorithms. The proposed model outperforms or performs competitively with state-of-the-art models on several KBQA benchmarks.
arXiv Detail & Related papers (2022-02-22T01:34:35Z)
Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis [61.740077541531726]
We provide a new central and open leaderboard for any KGQA benchmark dataset as a focal point for the community. Our analysis highlights existing problems during the evaluation of KGQA systems.
arXiv Detail & Related papers (2022-01-20T13:46:01Z)
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering [122.84513233992422]
We propose a new model, QA-GNN, which addresses the problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) We show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning.
arXiv Detail & Related papers (2021-04-13T17:32:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.