Related papers: Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

URL: http://arxiv.org/abs/2601.07054v1
Date: Sun, 11 Jan 2026 20:24:25 GMT
Title: Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge
Authors: Zhuoyi Yang, Yurun Song, Iftekhar Ahmed, Ian Harris,
Abstract summary: We compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering.<n>We evaluate unsupervised fine-tuning, supervised fine-tuning, and retrieval-augmented generation.<n>Retrieval-augmented generation yields substantial and consistent improvements when answering questions that rely on temporally novel information.
Score: 7.716590111773082
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models' pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.

Related papers

Query-Specific GNN: A Comprehensive Graph Representation Learning Method for Retrieval Augmented Generation [23.133432599408327]
Multi-hop questions require the identification of multiple knowledge targets to form a synthesized answer.<n>Existing methods often struggle to fully understand the questions with complex semantic structures.<n>We propose a novel graph representation learning framework for multi-hop question retrieval.
arXiv Detail & Related papers (2025-10-13T15:41:15Z)
Omne-R1: Learning to Reason with Memory for Multi-hop Question Answering [23.78587569108481]
Omne-R1 is a novel approach designed to enhance multi-hop question answering capabilities on schema-free knowledge graphs.<n>Our method employs a multi-stage training workflow, including two reinforcement learning phases and one supervised fine-tuning phase.
arXiv Detail & Related papers (2025-08-24T12:36:48Z)
MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge [29.040298045598163]
MINTQA is a benchmark to evaluate large language models' capabilities in multi-hop reasoning.<n> MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge.<n>Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries.
arXiv Detail & Related papers (2024-12-22T14:17:12Z)
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [92.5712549836791]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z)
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge. We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
RECKONING: Reasoning through Dynamic Knowledge Encoding [51.076603338764706]
We show that language models can answer questions by reasoning over knowledge provided as part of the context. In these situations, the model fails to distinguish the knowledge that is necessary to answer the question. We propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters.
arXiv Detail & Related papers (2023-05-10T17:54:51Z)
Multi-hop Commonsense Knowledge Injection Framework for Zero-Shot Commonsense Question Answering [6.086719709100659]
We propose a novel multi-hop commonsense knowledge injection framework. Our framework achieves state-of-art performance on five commonsense question answering benchmarks.
arXiv Detail & Related papers (2023-05-10T07:13:47Z)
Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering [85.79940770146557]
We decompose multi-hop questions into multiple corresponding single-hop questions. We find marked inconsistency in QA models' answers on these pairs of ostensibly identical question chains. When trained only on single-hop questions, models generalize poorly to multi-hop questions.
arXiv Detail & Related papers (2022-10-09T11:48:07Z)
Reinforced Multi-task Approach for Multi-hop Question Generation [47.15108724294234]
We take up Multi-hop question generation, which aims at generating relevant questions based on supporting facts in the context. We employ multitask learning with the auxiliary task of answer-aware supporting fact prediction to guide the question generator. We demonstrate the effectiveness of our approach through experiments on the multi-hop question answering dataset, HotPotQA.
arXiv Detail & Related papers (2020-04-05T10:16:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.