Related papers: OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models

OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models

URL: http://arxiv.org/abs/2410.23703v1
Date: Thu, 31 Oct 2024 07:48:44 GMT
Title: OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models
Authors: Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, Jingbo Shang, Julian McAuley,
Abstract summary: This work focuses on the offline evaluation of the chain-of-thought capabilities of LLMs. We use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. We show how to optimize LLMs based on the proposed evaluation method.
Score: 68.17018458283651
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline evaluation of LLMs is crucial in understanding their capacities, though current methods remain underexplored in existing research. In this work, we focus on the offline evaluation of the chain-of-thought capabilities and show how to optimize LLMs based on the proposed evaluation method. To enable offline feedback with rich knowledge and reasoning paths, we use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. Due to the heterogeneity between LLM reasoning and KG structures, direct interaction and feedback from KGs on LLM behavior are challenging, as they require accurate entity linking and grounding of LLM-generated chains of thought in the KG. To address the above challenge, we propose an offline chain-of-thought evaluation framework, OCEAN, which models chain-of-thought reasoning in LLMs as an MDP and evaluate the policy's alignment with KG preference modeling. To overcome the reasoning heterogeneity and grounding problems, we leverage on-policy KG exploration and RL to model a KG policy that generates token-level likelihood distributions for LLM-generated chain-of-thought reasoning paths, simulating KG reasoning preference. Then we incorporate the knowledge-graph feedback on the validity and alignment of the generated reasoning paths into inverse propensity scores and propose KG-IPS estimator. Theoretically, we prove the unbiasedness of the proposed KG-IPS estimator and provide a lower bound on its variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance chain-of-thought alignment. Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs' general abilities in downstream tasks or their internal knowledge.

Related papers

Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
The Hidden Link Between RLHF and Contrastive Learning [24.828596020853727]
We show that Reinforcement Learning from Human Feedback and Direct Preference Optimization can be interpreted from the perspective of mutual information.<n>Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning.<n>Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization.
arXiv Detail & Related papers (2025-06-27T18:51:25Z)
Hybrid Latent Reasoning via Reinforcement Learning [51.06635386903026]
We explore latent reasoning by leveraging the capabilities of large language models (LLMs) via reinforcement learning (RL)<n>We introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that integrates prior hidden states into sampled tokens with a learnable gating mechanism.<n>HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths.
arXiv Detail & Related papers (2025-05-24T01:26:16Z)
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z)
Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM [11.181783720439563]
Large Language Models (LLMs) display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation.<n>RLMs often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting.<n>We introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs.
arXiv Detail & Related papers (2025-05-20T03:54:57Z)
MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model [54.14155564592936]
We propose a Mixture of Rule Experts guided by a Large Language Model (MoRE-LLM) MoRE-LLM steers the discovery of local rule-based surrogates during training and their utilization for the classification task. LLM is responsible for enhancing the domain knowledge alignment of the rules by correcting and contextualizing them.
arXiv Detail & Related papers (2025-03-26T11:09:21Z)
Online Preference Alignment for Language Models via Count-based Exploration [46.46627519343809]
Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage. Online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs.
arXiv Detail & Related papers (2025-01-22T09:12:09Z)
SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts [0.6291443816903801]
This paper introduces a novel framework designed to autonomously evaluate the robustness of large language models (LLMs) Our method generates descriptive sentences from domain-constrained knowledge graph triplets to formulate adversarial prompts. This self-evaluation mechanism allows the LLM to evaluate its robustness without the need for external benchmarks.
arXiv Detail & Related papers (2024-12-01T10:58:53Z)
Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation [9.844598565914055]
Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. We introduce SubgraphRAG, extending the Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) framework that retrieves subgraphs. Our approach innovatively integrates a lightweight multilayer perceptron with a parallel triple-scoring mechanism for efficient and flexible subgraph retrieval.
arXiv Detail & Related papers (2024-10-28T04:39:32Z)
Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains [66.55612528039894]
Knowledge Graphs (KGs) can serve as reliable knowledge sources for question answering (QA) We present DoG (Decoding on Graphs), a novel framework that facilitates a deep synergy between LLMs and KGs. Experiments across various KGQA tasks with different background KGs demonstrate that DoG achieves superior and robust performance.
arXiv Detail & Related papers (2024-10-24T04:01:40Z)
GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning framework that integrates the parametric and non-parametric memories. Our method facilitates a more logical and step-wise reasoning approach akin to experts' problem-solving, rather than gold answer retrieval.
arXiv Detail & Related papers (2024-10-11T03:05:06Z)
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering [35.2451096137883]
We introduce OKGQA, a new benchmark specifically designed to assess Large Language Models (LLMs) enhanced with Knowledge Graphs (KGs) OKGQA is designed to closely reflect the complexities of practical applications using questions from different types, and incorporates specific metrics to measure both the reduction in hallucinations and the enhancement in reasoning capabilities. We also propose OKGQA-P to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated.
arXiv Detail & Related papers (2024-10-10T16:29:21Z)
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates. We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
Balancing Exploration and Exploitation in LLM using Soft RLLF for Enhanced Negation Understanding [4.799288023353623]
Finetuning approaches in NLP often focus on exploitation rather than exploration, which may lead to suboptimal models. We leverage Reinforcement Learning from Logical Feedback to create an effective balance between exploration and exploitation in language models. This has implications for the development of more accurate, reliable, and logically consistent language models in high-stakes domains.
arXiv Detail & Related papers (2024-03-02T11:54:55Z)
Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers. We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z)
Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-based Retrofitting [51.7049140329611]
This paper proposes Knowledge Graph-based Retrofitting (KGR) to mitigate factual hallucination during the reasoning process. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks.
arXiv Detail & Related papers (2023-11-22T11:08:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.