CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning
- URL: http://arxiv.org/abs/2409.03577v2
- Date: Thu, 30 Jan 2025 12:44:14 GMT
- Title: CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning
- Authors: John Birkbeck, Adam Sobey, Federico Cerutti, Katherine Heseltine Hurley Flynn, Timothy J. Norman,
- Abstract summary: Reinforcement learning (RL) agents are costly to train and fragile to environmental changes.<n>No prior work has established whether the impact on agent performance can be predicted from the change itself.<n>We propose Change-Induced Regret Proxy (CHIRP) metrics to link change to agent performance drops.
- Score: 5.825410941577592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) agents are costly to train and fragile to environmental changes. They often perform poorly when there are many changing tasks, prohibiting their widespread deployment in the real world. Many Lifelong RL agent designs have been proposed to mitigate issues such as catastrophic forgetting or demonstrate positive characteristics like forward transfer when change occurs. However, no prior work has established whether the impact on agent performance can be predicted from the change itself. Understanding this relationship will help agents proactively mitigate a change's impact for improved learning performance. We propose Change-Induced Regret Proxy (CHIRP) metrics to link change to agent performance drops and use two environments to demonstrate a CHIRP's utility in lifelong learning. A simple CHIRP-based agent achieved $48\%$ higher performance than the next best method in one benchmark and attained the best success rates in 8 of 10 tasks in a second benchmark which proved difficult for existing lifelong RL agents.
Related papers
- Reward-Conditioned Reinforcement Learning [56.417273471201845]
We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications.<n>RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy.<n>Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
arXiv Detail & Related papers (2026-03-05T11:29:17Z) - Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents [54.18201810286764]
Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering.<n>In long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance.<n>We propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior.
arXiv Detail & Related papers (2026-02-02T12:52:14Z) - Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z) - AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI [5.165179548592513]
AgentChangeBench is a benchmark designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts.<n>Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and GoalShift Recovery Time (GSRT) for adaptation.
arXiv Detail & Related papers (2025-10-20T23:48:07Z) - Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking [13.417125511014447]
We propose an automated framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences.<n>PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches.
arXiv Detail & Related papers (2025-10-14T23:18:24Z) - Reinforcement Learning for Machine Learning Engineering Agents [52.03168614623642]
We show that agents backed by weaker models that improve via reinforcement learning can outperform agents backed by much larger, but static models.<n>We propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions.<n>We also propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early.
arXiv Detail & Related papers (2025-09-01T18:04:10Z) - Causal Knowledge Transfer for Multi-Agent Reinforcement Learning in Dynamic Environments [1.2787026473187368]
Multi-agent reinforcement learning (MARL) has achieved notable success in environments where agents must learn coordinated behaviors.<n>Traditional knowledge transfer methods in MARL struggle to generalize, and agents often require costly retraining to adapt.<n>This paper introduces a causal knowledge transfer framework that enables RL agents to learn and share compact causal representations of paths within a non-stationary environment.
arXiv Detail & Related papers (2025-07-18T11:59:55Z) - JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning [6.81021875668872]
We propose JoyAgents-R1, which first applies Group Relative Policy Optimization to the joint training of heterogeneous multi-agents.<n>We show that JoyAgents-R1 achieves performance comparable to that of larger LLMs while built on smaller open-source models.
arXiv Detail & Related papers (2025-06-24T17:59:31Z) - RRO: LLM Agent Optimization Through Rising Reward Trajectories [52.579992804584464]
Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks.<n>In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task.<n>We propose Reward Rising Optimization (RRO) to mitigate this issue.
arXiv Detail & Related papers (2025-05-27T05:27:54Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.
Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.
We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - $TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning [7.97295726921338]
Temporal-Agent Reward Redistribution $TAR2$ is a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components.
We show that $TAR2$ aligns with potential-based reward shaping, preserving the same optimal policies as the original environment.
arXiv Detail & Related papers (2025-02-07T12:07:57Z) - No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents.<n>The impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored.<n>This paper investigates what is the resilience of various system structures under faulty agents on different downstream tasks.
arXiv Detail & Related papers (2024-08-02T03:25:20Z) - CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms [5.331052581441265]
We develop a novel algorithm, Corrected Uniform Experience (CUER), which samples the stored experience while considering the fairness among all other experiences.
CUER provides promising improvements for off-policy continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training.
arXiv Detail & Related papers (2024-06-13T12:03:40Z) - A State Augmentation based approach to Reinforcement Learning from Human
Preferences [20.13307800821161]
Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs.
We present a state augmentation technique that allows the agent's reward model to be robust.
arXiv Detail & Related papers (2023-02-17T07:10:50Z) - Towards Skilled Population Curriculum for Multi-Agent Reinforcement
Learning [42.540853953923495]
We introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination.
Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents.
We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound.
arXiv Detail & Related papers (2023-02-07T12:30:52Z) - TransfQMix: Transformers for Leveraging the Graph Structure of
Multi-Agent Reinforcement Learning Problems [0.0]
We present TransfQMix, a new approach that uses transformers to leverage a latent graph structure and learn better coordination policies.
Our transformer Q-mixer learns a monotonic mixing-function from a larger graph that includes the internal and external states of the agents.
We report TransfQMix's performances in the Spread and StarCraft II environments.
arXiv Detail & Related papers (2023-01-13T00:07:08Z) - RPM: Generalizable Behaviors for Multi-Agent Reinforcement Learning [90.43925357575543]
We propose ranked policy memory ( RPM) to collect diverse multi-agent trajectories for training MARL policies with good generalizability.
RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks, and it significantly boosts the performance up to 402% on average.
arXiv Detail & Related papers (2022-10-18T07:32:43Z) - Decentralized scheduling through an adaptive, trading-based multi-agent
system [1.7403133838762448]
In multi-agent reinforcement learning systems, the actions of one agent can have a negative impact on the rewards of other agents.
This work applies a trading approach to a simulated scheduling environment, where the agents are responsible for the assignment of incoming jobs to compute cores.
The agents can trade the usage right of computational cores to process high-priority, high-reward jobs faster than low-priority, low-reward jobs.
arXiv Detail & Related papers (2022-07-05T13:50:18Z) - Modeling Bounded Rationality in Multi-Agent Simulations Using Rationally
Inattentive Reinforcement Learning [85.86440477005523]
We study more human-like RL agents which incorporate an established model of human-irrationality, the Rational Inattention (RI) model.
RIRL models the cost of cognitive information processing using mutual information.
We show that using RIRL yields a rich spectrum of new equilibrium behaviors that differ from those found under rational assumptions.
arXiv Detail & Related papers (2022-01-18T20:54:00Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Multi-Agent Transfer Learning in Reinforcement Learning-Based
Ride-Sharing Systems [3.7311680121118345]
Reinforcement learning (RL) has been used in a range of simulated real-world tasks.
In this paper we investigate the impact of TL transfer parameters with fixed source and target roles.
arXiv Detail & Related papers (2021-12-01T11:23:40Z) - Why Do Self-Supervised Models Transfer? Investigating the Impact of
Invariance on Downstream Tasks [79.13089902898848]
Self-supervised learning is a powerful paradigm for representation learning on unlabelled images.
We show that different tasks in computer vision require features to encode different (in)variances.
arXiv Detail & Related papers (2021-11-22T18:16:35Z) - Low-Precision Reinforcement Learning [63.930246183244705]
Low-precision training has become a popular approach to reduce computation time, memory footprint, and energy consumption in supervised learning.
In this paper we consider continuous control with the state-of-the-art SAC agent and demonstrate that a na"ive adaptation of low-precision methods from supervised learning fails.
arXiv Detail & Related papers (2021-02-26T16:16:28Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z) - Enhancing reinforcement learning by a finite reward response filter with
a case study in intelligent structural control [0.0]
In many reinforcement learning (RL) problems, it takes some time until a taken action by the agent reaches its maximum effect on the environment.
This paper introduces an applicable enhanced Q-learning method in which at the beginning of the learning phase, the agent takes a single action.
We have applied the developed method to a structural control problem in which the goal of the agent is to reduce the vibrations of a building subjected to earthquake excitations with a specified delay.
arXiv Detail & Related papers (2020-10-25T19:28:35Z) - Hierarchically Decoupled Imitation for Morphological Transfer [95.19299356298876]
We show that transferring learned information from a morphologically simpler agent can massively improve the sample efficiency of a more complex one.
First, we show that incentivizing a complex agent's low-level to imitate a simpler agent's low-level significantly improves zero-shot high-level transfer.
Second, we show that KL-regularized training of the high level stabilizes learning and prevents mode-collapse.
arXiv Detail & Related papers (2020-03-03T18:56:49Z) - Meta Reinforcement Learning with Autonomous Inference of Subtask
Dependencies [57.27944046925876]
We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph.
Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference.
Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter.
arXiv Detail & Related papers (2020-01-01T17:34:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.