GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation
- URL: http://arxiv.org/abs/2508.16994v1
- Date: Sat, 23 Aug 2025 11:26:41 GMT
- Title: GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation
- Authors: Jeongsoo Lee, Daeyong Kwon, Kyohoon Jin,
- Abstract summary: Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks.<n>Current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios.<n>We propose textscGRADE, a novel evaluation framework that models task difficulty along two dimensions.
- Score: 5.002953635224383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose \textsc{GRADE}, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty matrix that combines generator-side and retriever-side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. \textsc{GRADE} enables fine-grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications.
Related papers
- HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG [53.30561659838455]
Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations.<n>Retrieval-Augmented Generation (RAG) frequently overlooks structural interdependencies essential for multi-hop reasoning.<n>Help achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$times$ speedup over leading Graph-based RAG baselines.
arXiv Detail & Related papers (2026-02-24T14:05:29Z) - Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering [14.456873356080186]
Reasoning Tree Guided RAG (RT-RAG) is a novel hierarchical framework for complex multi-hop QA.<n>RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition.
arXiv Detail & Related papers (2026-01-16T13:02:25Z) - Multi-hop Reasoning via Early Knowledge Alignment [68.28168992785896]
Early Knowledge Alignment (EKA) aims to align Large Language Models with contextually relevant retrieved knowledge.<n>EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency.<n>EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models.
arXiv Detail & Related papers (2025-12-23T08:14:44Z) - BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data [8.52473384574856]
We present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources.<n>The system grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion.
arXiv Detail & Related papers (2025-10-28T07:43:15Z) - Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z) - LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval [10.566901995776025]
LeanRAG is a framework that combines knowledge aggregation and retrieval strategies.<n>It can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval.
arXiv Detail & Related papers (2025-08-14T06:47:18Z) - MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation [5.243680711038957]
MultiRAG is a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation.<n>It employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources.<n>It implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes.
arXiv Detail & Related papers (2025-08-05T15:20:52Z) - Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z) - Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z) - Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents [40.37993572657772]
We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions.<n>We demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.
arXiv Detail & Related papers (2025-05-30T19:23:51Z) - Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning [62.640169289390535]
SPLIT-RAG is a multi-agent RAG framework that addresses the limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval.<n>The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG.<n>The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types.<n>A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications.
arXiv Detail & Related papers (2025-05-20T06:44:34Z) - DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering [57.875992666888855]
Multi-Hop Question Answering (MHQA) tasks pose challenges in orchestrating multi-step reasoning across diverse knowledge domains.<n>We propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval.
arXiv Detail & Related papers (2025-04-25T10:43:53Z) - Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives.<n>We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z) - MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation [5.525151548786079]
Existing RAG benchmarks often overlook query difficulty, leading to inflated performance on simpler questions and unreliable evaluations.<n>We propose MHTS (Multi-Hop Tree Structure), a novel dataset synthesis framework that controls multi-hop reasoning complexity by leveraging a multi-hop tree structure to generate logically connected, multi-chunk queries.
arXiv Detail & Related papers (2025-03-29T06:26:01Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets.
We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.