Related papers: Evaluating and Enhancing Large Language Models for Conversational Reasoning on Knowledge Graphs

Evaluating and Enhancing Large Language Models for Conversational Reasoning on Knowledge Graphs

URL: http://arxiv.org/abs/2312.11282v2
Date: Sun, 4 Feb 2024 03:45:04 GMT
Title: Evaluating and Enhancing Large Language Models for Conversational Reasoning on Knowledge Graphs
Authors: Yuxuan Huang, Lida Shi, Anqi Liu and Hao Xu
Abstract summary: We evaluate the conversational reasoning capabilities of the current state-of-the-art large language model (GPT-4) on knowledge graphs (KGs) We introduce LLM-ARK, a grounded KG reasoning agent designed to deliver precise and adaptable predictions on KG paths. LLaMA-2-7B-ARK outperforms the current state-of-the-art model by 5.28 percentage points, with a performance rate of 36.39% on the target@1 evaluation metric.
Score: 15.480976967871632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The development of large language models (LLMs) has been catalyzed by advancements in pre-training techniques. These models have demonstrated robust reasoning capabilities through manually designed prompts. In this work, we evaluate the conversational reasoning capabilities of the current state-of-the-art LLM (GPT-4) on knowledge graphs (KGs). However, the performance of LLMs is constrained due to a lack of KG environment awareness and the difficulties in developing effective optimization mechanisms for intermediary reasoning stages. We further introduce LLM-ARK, a LLM grounded KG reasoning agent designed to deliver precise and adaptable predictions on KG paths. LLM-ARK leverages Full Textual Environment (FTE) prompt to assimilate state information within each reasoning step. We reframe the challenge of multi-hop reasoning on the KG as a sequential decision-making task. Utilizing the Proximal Policy Optimization (PPO) online policy gradient reinforcement learning algorithm, our model is optimized to learn from rich reward signals. Additionally, we conduct an evaluation of our model and GPT-4 on the OpenDialKG dataset. The experimental results reveal that LLaMA-2-7B-ARK outperforms the current state-of-the-art model by 5.28 percentage points, with a performance rate of 36.39% on the target@1 evaluation metric. Meanwhile, GPT-4 scored 14.91%, further demonstrating the effectiveness of our method. Our code is available on GitHub (https://github.com/Aipura/LLM-ARK) for further access.

Related papers

Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models [0.18416014644193066]
We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems.<n>Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period.
arXiv Detail & Related papers (2025-05-26T08:34:07Z)
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning [55.6623318085391]
Recent large language model (LLM) reasoning suffers from limited domain knowledge, susceptibility to hallucinations, and constrained reasoning depth. This paper presents the first investigation into integrating step-wise knowledge graph retrieval with step-wise reasoning. We propose KG-RAR, a framework centered on process-oriented knowledge graph construction, a hierarchical retrieval strategy, and a universal post-retrieval processing and reward model.
arXiv Detail & Related papers (2025-03-03T15:20:41Z)
Supervised Fine-Tuning LLMs to Behave as Pedagogical Agents in Programming Education [41.69192181482715]
We present the development of GuideLM, a fine-tuned large language model (LLMs) for programming education. GuideLM has been integrated into the C Compiler (DCC), an educational C compiler that leverages LLMs to generate pedagogically sound error explanations. We conducted an expert analysis of 400 responses per model, comparing their pedagogical effectiveness against base OpenAI models. Results indicate that GuideLM and GuideLM-mini improve pedagogical performance, with an 8% increase in Socratic guidance and a 58% improvement in economy of words compared to GPT-4o.
arXiv Detail & Related papers (2025-02-27T21:23:56Z)
OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models [68.17018458283651]
This work focuses on the offline evaluation of the chain-of-thought capabilities of LLMs. We use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. We show how to optimize LLMs based on the proposed evaluation method.
arXiv Detail & Related papers (2024-10-31T07:48:44Z)
Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning [19.442426875488675]
We propose Paths-over-Graph (PoG), a novel method that enhances Large Language Models (LLMs) reasoning by integrating knowledge reasoning paths from KGs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration. In experiments, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.
arXiv Detail & Related papers (2024-10-18T06:57:19Z)
GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z)
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering [35.2451096137883]
We introduce OKGQA, a new benchmark specifically designed to assess Large Language Models (LLMs) enhanced with Knowledge Graphs (KGs) OKGQA is designed to closely reflect the complexities of practical applications using questions from different types, and incorporates specific metrics to measure both the reduction in hallucinations and the enhancement in reasoning capabilities. We also propose OKGQA-P to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated.
arXiv Detail & Related papers (2024-10-10T16:29:21Z)
PRAGyan -- Connecting the Dots in Tweets [0.0]
This research explores the integration of Knowledge Graphs (KGs) with Large Language Models (LLMs) to perform causal analysis of tweets dataset. We employ a Retrieval-Augmented Generation (RAG) model, utilizing a KG stored in a Neo4j (a.k.a PRAGyan) data format, to retrieve relevant context for causal reasoning.
arXiv Detail & Related papers (2024-07-18T21:49:32Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph [134.8631016845467]
We propose an autonomous LLM-based agent framework, called KG-Agent. In KG-Agent, we integrate the LLM, multifunctional toolbox, KG-based executor, and knowledge memory. To guarantee the effectiveness, we leverage program language to formulate the multi-hop reasoning process over the KG.
arXiv Detail & Related papers (2024-02-17T02:07:49Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
GLoRE: Evaluating Logical Reasoning of Large Language Models [29.914546407784552]
We introduce GLoRE, a benchmark comprised of 12 datasets that span three different types of tasks. ChatGPT and GPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing ChatGPT by a large margin. We propose a self-consistency probing method to enhance the accuracy of ChatGPT and a fine-tuned method to boost the performance of an open LLM.
arXiv Detail & Related papers (2023-10-13T13:52:15Z)
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond [29.778018058541676]
GPT-Fathom is an open-source and reproducible evaluation suite for large language models (LLMs) built on top of OpenAI Evals. We evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all aligned under settings.
arXiv Detail & Related papers (2023-09-28T16:43:35Z)
Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks [90.11273439036455]
Large Language Models (LLMs) have shown promising performance in knowledge-intensive reasoning tasks. We propose Knowledge-Augmented Reasoning Distillation (KARD), a novel method that fine-tunes small LMs to generate rationales from LLMs with augmented knowledge retrieved from an external knowledge base. We empirically show that KARD significantly improves the performance of small T5 and GPT models on the challenging knowledge-intensive reasoning datasets.
arXiv Detail & Related papers (2023-05-28T13:00:00Z)
LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities [66.36633042421387]
Large Language Models (LLMs) for Knowledge Graph (KG) construction and reasoning evaluated. We propose AutoKG, a multi-agent-based approach employing LLMs and external sources for KG construction and reasoning.
arXiv Detail & Related papers (2023-05-22T15:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.