Follow the Path: Reasoning over Knowledge Graph Paths to Improve LLM Factuality
- URL: http://arxiv.org/abs/2505.11140v2
- Date: Fri, 26 Sep 2025 09:25:25 GMT
- Title: Follow the Path: Reasoning over Knowledge Graph Paths to Improve LLM Factuality
- Authors: Mike Zhang, Johannes Bjerva, Russa Biswas,
- Abstract summary: We introduce fs1, a method that improves the factuality of reasoning traces by sourcing them from large reasoning models.<n>We fine-tune eight instruction-tuned Large Language Models (LLMs) on 3.9K factually grounded reasoning traces.<n>Our results demonstrate that our fs1-tuned model consistently outperforms instruction-tuned counterparts with parallel sampling by 6-14 absolute points.
- Score: 13.453581744429352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce fs1, a simple yet effective method that improves the factuality of reasoning traces by sourcing them from large reasoning models (e.g., DeepSeek-R1) and grounding them by conditioning on knowledge graph (KG) paths. We fine-tune eight instruction-tuned Large Language Models (LLMs) on 3.9K factually grounded reasoning traces and rigorously evaluate them on six complex open-domain question-answering (QA) benchmarks encompassing 23.9K questions. Our results demonstrate that our fs1-tuned model (32B parameters) consistently outperforms instruction-tuned counterparts with parallel sampling by 6-14 absolute points (pass@$16$). Our detailed analysis shows that fs1 considerably improves model performance over more complex questions (requiring 3 or more hops on KG paths) and numerical answer types compared to the baselines. Furthermore, in single-pass inference, we notice that smaller LLMs show the most improvements. While prior works demonstrate the effectiveness of reasoning traces primarily in the STEM domains, our work shows strong evidence that anchoring reasoning to factual KG paths is a critical step in transforming LLMs for reliable knowledge-intensive tasks.
Related papers
- RPO-RAG: Aligning Small LLMs with Relation-aware Preference Optimization for Knowledge Graph Question Answering [5.044093798902163]
RPO-RAG is the first KG-based RAG framework specifically designed for small language models.<n>RPO-RAG introduces a query-path semantic sampling strategy that provides informative supervisory signals.<n>Extensive experiments on two benchmark Knowledge Graph Question Answering (KGQA) datasets demonstrate that RPO-RAG effectively bridges the performance gap between small and large language models.
arXiv Detail & Related papers (2026-01-27T05:46:32Z) - Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z) - From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs [58.02809208460186]
We revisit this paradox using high-quality traces from DeepSeek-R1 as demonstrations.<n>We find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal.<n>We introduce Insight-to-solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights.
arXiv Detail & Related papers (2025-09-27T08:59:31Z) - KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration [29.320693000484273]
KG-o1 is a four-stage approach that integrates knowledge graphs to enhance the multi-hop reasoning abilities of Large Language Models.<n>We conduct experiments on two simple and two complex datasets.<n>The results show that KG-o1 models exhibit superior performance across all tasks compared to existing LRMs.
arXiv Detail & Related papers (2025-08-12T04:29:10Z) - Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes [16.451488374845407]
We present a novel framework addressing a critical vulnerability in Large Language Models (LLMs)<n>This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research.
arXiv Detail & Related papers (2025-07-25T10:34:51Z) - Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [67.67725938962798]
Pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks.<n>We introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation.<n>We show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary.
arXiv Detail & Related papers (2025-07-14T17:55:15Z) - Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z) - Interleaved Reasoning for Large Language Models via Reinforcement Learning [22.403928213802036]
Long chain-of-thought (CoT) enhances large language models' (LLM) reasoning capabilities.<n>We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions.
arXiv Detail & Related papers (2025-05-26T07:58:17Z) - Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens [51.90059610606049]
This paper revisits the efficiency of such reasoning processes through an information-theoretic lens.<n>We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution.<n>Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high.
arXiv Detail & Related papers (2025-05-23T13:38:56Z) - Phi-4-reasoning Technical Report [42.508165017775]
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks.<n>We develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning.<n>Both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.
arXiv Detail & Related papers (2025-04-30T05:05:09Z) - Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families.<n>Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths.<n>We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - LIMO: Less is More for Reasoning [23.312893016642096]
We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples.<n>Our model, LIMO, achieves 63.3% accuracy on AIME24 and 95.6% on MATH500.<n>LIMO exhibits strong out-of-distribution generalization, achieving a 45.8% absolute improvement across diverse benchmarks.
arXiv Detail & Related papers (2025-02-05T17:23:45Z) - The Surprising Effectiveness of Test-Time Training for Few-Shot Learning [59.309477460893916]
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks.<n>We investigate the effectiveness of test-time training (TTT) as a mechanism for improving LMs' reasoning and few-shot learning capabilities.<n>Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models [73.4425450752596]
Chain-of-thought (CoT) prompting has impressively unlocked the reasoning potential of large language models (LLMs)
Yet, the standard CoT is less effective in problems demanding multiple reasoning steps.
We propose RESPROMPT, a new prompting strategy that advances multi-step reasoning in LLMs.
arXiv Detail & Related papers (2023-10-07T08:56:28Z) - Logic-Guided Data Augmentation and Regularization for Consistent
Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions.
Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.