Related papers: Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution

Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution

URL: http://arxiv.org/abs/2410.21940v1
Date: Tue, 29 Oct 2024 10:57:33 GMT
Title: Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution
Authors: Senne Deproost, Denis Steckelmacher, Ann Nowé,
Abstract summary: We build on TD3 and use its critics as the basis of the objective function of a genetic algorithm that syntheses the program. Our approach steers the program to actual high rewards, instead of a simple Mean Squared Error.
Score: 4.831084635928491
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With Deep Reinforcement Learning (DRL) being increasingly considered for the control of real-world systems, the lack of transparency of the neural network at the core of RL becomes a concern. Programmatic Reinforcement Learning (PRL) is able to to create representations of this black-box in the form of source code, not only increasing the explainability of the controller but also allowing for user adaptations. However, these methods focus on distilling a black-box policy into a program and do so after learning using the Mean Squared Error between produced and wanted behaviour, discarding other elements of the RL algorithm. The distilled policy may therefore perform significantly worse than the black-box learned policy. In this paper, we propose to directly learn a program as the policy of an RL agent. We build on TD3 and use its critics as the basis of the objective function of a genetic algorithm that syntheses the program. Our approach builds the program during training, as opposed to after the fact. This steers the program to actual high rewards, instead of a simple Mean Squared Error. Also, our approach leverages the TD3 critics to achieve high sample-efficiency, as opposed to pure genetic methods that rely on Monte-Carlo evaluations. Our experiments demonstrate the validity, explainability and sample-efficiency of our approach in a simple gridworld environment.

Related papers

SHIRE: Enhancing Sample Efficiency using Human Intuition in REinforcement Learning [11.304750795377657]
We propose SHIRE, a framework for encoding human intuition using Probabilistic Graphical Models (PGMs) SHIRE achieves 25-78% sample efficiency gains across the environments we evaluate at negligible overhead cost.
arXiv Detail & Related papers (2024-09-16T04:46:22Z)
Decomposing Control Lyapunov Functions for Efficient Reinforcement Learning [10.117626902557927]
Current Reinforcement Learning (RL) methods require large amounts of data to learn a specific task, leading to unreasonable costs when deploying the agent to collect data in real-world applications. In this paper, we build from existing work that reshapes the reward function in RL by introducing a Control Lyapunov Function (CLF) to reduce the sample complexity. We show that our method finds a policy to successfully land a quadcopter in less than half the amount of real-world data required by the state-of-the-art Soft-Actor Critic algorithm.
arXiv Detail & Related papers (2024-03-18T19:51:17Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
On Reducing Undesirable Behavior in Deep Reinforcement Learning Models [0.0]
We propose a novel framework aimed at drastically reducing the undesirable behavior of DRL-based software. Our framework can assist in providing engineers with a comprehensible characterization of such undesirable behavior.
arXiv Detail & Related papers (2023-09-06T09:47:36Z)
Large Language Models can Implement Policy Iteration [18.424558160071808]
In-Context Policy Iteration is an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. ICPI learns to perform RL tasks without expert demonstrations or gradients. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment.
arXiv Detail & Related papers (2022-10-07T21:18:22Z)
Verifying Learning-Based Robotic Navigation Systems [61.01217374879221]
We show how modern verification engines can be used for effective model selection. Specifically, we use verification to detect and rule out policies that may demonstrate suboptimal behavior. Our work is the first to demonstrate the use of verification backends for recognizing suboptimal DRL policies in real-world robots.
arXiv Detail & Related papers (2022-05-26T17:56:43Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration [9.017416068706579]
A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. We develop an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy. We demonstrate the superior performance of our algorithm over state-of-the-art approaches.
arXiv Detail & Related papers (2022-02-09T18:45:40Z)
Robust Predictable Control [149.71263296079388]
We show that our method achieves much tighter compression than prior methods, achieving up to 5x higher reward than a standard information bottleneck. We also demonstrate that our method learns policies that are more robust and generalize better to new tasks.
arXiv Detail & Related papers (2021-09-07T17:29:34Z)
Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs. We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z)
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback. We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.