Related papers: Enhancing Q-Learning with Large Language Model Heuristics

Enhancing Q-Learning with Large Language Model Heuristics

URL: http://arxiv.org/abs/2405.03341v3
Date: Fri, 24 May 2024 06:32:06 GMT
Title: Enhancing Q-Learning with Large Language Model Heuristics
Authors: Xiefeng Wu,
Abstract summary: Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. We propose textbfLLM-guided Q-learning, a framework that leverages LLMs as hallucinations to aid in learning the Q-function for reinforcement learning.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Q-learning excels in learning from feedback within sequential decision-making tasks but often requires extensive sampling to achieve significant improvements. While reward shaping can enhance learning efficiency, non-potential-based methods introduce biases that affect performance, and potential-based reward shaping, though unbiased, lacks the ability to provide heuristics for state-action pairs, limiting its effectiveness in complex environments. Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. To address these challenges, we propose \textbf{LLM-guided Q-learning}, a framework that leverages LLMs as heuristics to aid in learning the Q-function for reinforcement learning. Our theoretical analysis demonstrates that this approach adapts to hallucinations, improves sample efficiency, and avoids biasing final performance. Experimental results show that our algorithm is general, robust, and capable of preventing ineffective exploration.

Related papers

GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning [15.43938821214447]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs)<n>This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework.<n>GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance.
arXiv Detail & Related papers (2025-07-14T08:10:00Z)
MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models [95.6332110724999]
Motivation-enhanced Reinforcement Finetuning (MeRF) is an intuitive yet effective method enhancing reinforcement learning of Large Language Models (LLMs)<n>MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective.<n> Empirical evaluations on the Knights and Knaves(K&K) logic puzzle reasoning benchmark demonstrate that textttMeRF achieves substantial performance gains over baselines.
arXiv Detail & Related papers (2025-06-23T10:37:57Z)
Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement [101.77467538102924]
Large reasoning models (LRMs) exhibit overthinking, which hinders efficiency and inflates inference cost.<n>We propose two lightweight methods to enhance LRM efficiency.<n>First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction.<n>Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity.
arXiv Detail & Related papers (2025-06-18T17:18:12Z)
ICPL: Few-shot In-context Preference Learning via LLMs [15.84585737510038]
We show that Large Language Models (LLMs) have native preference-learning capabilities that allow them to achieve sample-efficient preference learning. We propose In-Context Preference Learning (ICPL), which uses in-context learning capabilities of LLMs to reduce human query inefficiency.
arXiv Detail & Related papers (2024-10-22T17:53:34Z)
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL [7.988692259455583]
Large language models (LLMs) trained with Reinforcement Learning from Human Feedback have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences.
arXiv Detail & Related papers (2024-10-16T12:14:25Z)
LLMs Are In-Context Reinforcement Learners [30.192422586838997]
Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL) This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation.
arXiv Detail & Related papers (2024-10-07T17:45:00Z)
BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
We introduce BloomWise, a new prompting technique, inspired by Bloom's taxonomy, to improve the performance of Large Language Models (LLMs) The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM. In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-10-05T09:27:52Z)
Predicting and Understanding Human Action Decisions: Insights from Large Language Models and Cognitive Instance-Based Learning [0.0]
Large Language Models (LLMs) have demonstrated their capabilities across various tasks. This paper exploits the reasoning and generative capabilities of the LLMs to predict human behavior in two sequential decision-making tasks. We compare the performance of LLMs with a cognitive instance-based learning model, which imitates human experiential decision-making.
arXiv Detail & Related papers (2024-07-12T14:13:06Z)
Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning [97.2995389188179]
Recent research has begun to approach large language models (LLMs) unlearning via gradient ascent (GA) Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning. We propose several controlling methods that can regulate the extent of excessive unlearning.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
Post Hoc Explanations of Language Models Can Improve Language Models [43.2109029463221]
We present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY) We leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. Our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks.
arXiv Detail & Related papers (2023-05-19T04:46:04Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations [78.94386823185724]
Imitation learning learns effectively in sparse-rewarded tasks by leveraging the existing expert demonstrations. In practice, collecting a sufficient amount of expert demonstrations can be prohibitively expensive. We propose Self-Adaptive Learning (SAIL) that can achieve (near) optimal performance given only a limited number of sub-optimal demonstrations.
arXiv Detail & Related papers (2020-04-01T15:57:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.