Related papers: LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning

LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning

URL: http://arxiv.org/abs/2508.18420v1
Date: Mon, 25 Aug 2025 19:10:58 GMT
Title: LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning
Authors: André Quadros, Cassio Silva, Ronnie Alves,
Abstract summary: This paper explores the combination of two intrinsic motivation strategies to improve the efficiency of learning agents in environments with extreme sparse rewards.<n>We propose integrating Variational State as Intrinsic Reward (VSIMR), which uses Variational AutoEncoders (VAEs) reward state novelty, with an intrinsic reward approach derived from Large Language Models (LLMs)<n>Our empirical results show that this combined strategy significantly increases agent performance and efficiency compared to using each strategy individually.
Score: 0.27528170226206433
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores the combination of two intrinsic motivation strategies to improve the efficiency of reinforcement learning (RL) agents in environments with extreme sparse rewards, where traditional learning struggles due to infrequent positive feedback. We propose integrating Variational State as Intrinsic Reward (VSIMR), which uses Variational AutoEncoders (VAEs) to reward state novelty, with an intrinsic reward approach derived from Large Language Models (LLMs). The LLMs leverage their pre-trained knowledge to generate reward signals based on environment and goal descriptions, guiding the agent. We implemented this combined approach with an Actor-Critic (A2C) agent in the MiniGrid DoorKey environment, a benchmark for sparse rewards. Our empirical results show that this combined strategy significantly increases agent performance and sampling efficiency compared to using each strategy individually or a standard A2C agent, which failed to learn. Analysis of learning curves indicates that the combination effectively complements different aspects of the environment and task: VSIMR drives exploration of new states, while the LLM-derived rewards facilitate progressive exploitation towards goals.

Related papers

Expanding LLM Agent Boundaries with Strategy-Guided Exploration [51.98616048282804]
Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding.<n>We propose Strategy-Guided Exploration (SGE) to shift exploration from low-level actions to higher-level language strategies.
arXiv Detail & Related papers (2026-03-02T16:28:39Z)
Reinforcement World Model Learning for LLM-based Agents [60.65003139516272]
Reinforcement World Model Learning (RWML) is a self-conditioned method that learns action-supervised world models for LLM-based agents.<n>Our method aligns simulated next states produced by the model with realized next states observed from the environment.<n>We evaluate our method on ALFWorld and $2$ Bench and observe significant gains over the base model, despite being entirely self-supervised.
arXiv Detail & Related papers (2026-02-05T16:30:08Z)
MIR: Efficient Exploration in Episodic Multi-Agent Reinforcement Learning via Mutual Intrinsic Reward [14.959716217301368]
This paper introduces Mutual Intrinsic Reward (MIR), a simple yet effective enhancement strategy for reinforcement learning.<n>MIR incentivizes individual agents to explore actions that affect their teammates, and when combined with original strategies, effectively stimulates team exploration and improves algorithm performance.<n>Our evalu-ation compares the proposed method against state-of-the-art approaches in the MiniGrid-MA setting, with experimental results demonstrating superior perfor-mance.
arXiv Detail & Related papers (2025-11-21T11:32:28Z)
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents [28.145430029174577]
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments.<n>Existing approaches typically rely on outcome-based rewards that are only provided at the final answer.<n>In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training.
arXiv Detail & Related papers (2025-10-16T17:59:32Z)
Continuous-Time Reinforcement Learning for Asset-Liability Management [0.0]
This paper proposes a novel approach for Asset-Liability Management (ALM) by employing continuous-time Reinforcement Learning (RL)<n>We develop a model-free, policy gradient-based soft actor-critic algorithm tailored to ALM for dynamically synchronizing assets and liabilities.<n>Our empirical study evaluates this approach against two enhanced traditional financial strategies, a model-based continuous-time RL method, and three state-of-the-art RL algorithms.
arXiv Detail & Related papers (2025-09-27T12:36:51Z)
Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning [15.034714081414691]
We study the problem of online multi-agent reinforcement learning (MARL) in environments with sparse rewards.<n>The lack of intermediate rewards hinders standard MARL algorithms from effectively guiding policy learning.<n>We propose a novel framework that integrates online inverse preference learning with multi-agent on-policy optimization.
arXiv Detail & Related papers (2025-09-26T03:41:40Z)
Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z)
Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment [4.406086834602686]
We show how to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution.<n>Our approach utilizes a centralized reward-critic which numerically decomposes the environment reward based on the individual contribution of each agent.<n>Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints.
arXiv Detail & Related papers (2025-02-24T05:56:47Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning [6.691759477350243]
Large language models (LLMs) trained with Reinforcement Learning from Human Feedback have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque.<n>This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions.<n>We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences.
arXiv Detail & Related papers (2024-10-16T12:14:25Z)
Efficient Reinforcement Learning via Decoupling Exploration and Utilization [6.305976803910899]
Reinforcement Learning (RL) has achieved remarkable success across multiple fields and applications, including gaming, robotics, and autonomous vehicles. In this work, our aim is to train agent with efficient learning by decoupling exploration and utilization, so that agent can escaping the conundrum of suboptimal Solutions. The above idea is implemented in the proposed OPARL (Optimistic and Pessimistic Actor Reinforcement Learning) algorithm.
arXiv Detail & Related papers (2023-12-26T09:03:23Z)
Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA) SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning. SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z)
Weakly Supervised Disentangled Representation for Goal-conditioned Reinforcement Learning [15.698612710580447]
We propose a skill learning framework DR-GRL that aims to improve the sample efficiency and policy generalization. In a weakly supervised manner, we propose a Spatial Transform AutoEncoder (STAE) to learn an interpretable and controllable representation. We empirically demonstrate that DR-GRL significantly outperforms the previous methods in sample efficiency and policy generalization.
arXiv Detail & Related papers (2022-02-28T09:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.