Related papers: Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

URL: http://arxiv.org/abs/2507.11269v1
Date: Tue, 15 Jul 2025 12:46:25 GMT
Title: Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound
Authors: Tal Fiskus, Uri Shaham,
Abstract summary: We introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL.<n>Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss.<n>This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded.
Score: 4.350004414611934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 2,427% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at negligible cost.

Related papers

The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning [19.01686700722506]
Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning.<n>We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy.<n>We propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination.
arXiv Detail & Related papers (2025-06-16T16:30:00Z)
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z)
Reevaluating Policy Gradient Methods for Imperfect-Information Games [94.45878689061335]
We conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games.<n>We find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods.
arXiv Detail & Related papers (2025-02-13T03:38:41Z)
Towards Sample-Efficiency and Generalization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review [50.67937325077047]
This paper is devoted to a comprehensive review of realizing the sample efficiency and generalization of RL algorithms through transfer and inverse reinforcement learning (T-IRL) Our findings denote that a majority of recent research works have dealt with the aforementioned challenges by utilizing human-in-the-loop and sim-to-real strategies. Under the IRL structure, training schemes that require a low number of experience transitions and extension of such frameworks to multi-agent and multi-intention problems have been the priority of researchers in recent years.
arXiv Detail & Related papers (2024-11-15T15:18:57Z)
Stop Regressing: Training Value Functions via Classification for Scalable Deep RL [109.44370201929246]
We show that training value functions with categorical cross-entropy improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers.
arXiv Detail & Related papers (2024-03-06T18:55:47Z)
Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent. In many real-world scenarios, designing immediate reward signals is difficult. We propose a novel reward redistribution method equipped with a bidirectional attention mechanism.
arXiv Detail & Related papers (2024-02-06T07:26:44Z)
Episodic Reinforcement Learning with Expanded State-reward Space [1.479675621064679]
We introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. Our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss.
arXiv Detail & Related papers (2024-01-19T06:14:36Z)
Replay across Experiments: A Natural Extension of Off-Policy RL [18.545939667810565]
We present an effective yet simple framework to extend the use of replays across multiple experiments. At its core, Replay Across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation.
arXiv Detail & Related papers (2023-11-27T15:57:11Z)
FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z)
Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context. We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)
Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning [17.3794999533024]
We show that deep RL appears to struggle in the presence of extraneous data. Recent works have shown that the performance of Deep Q-Network (DQN) degrades when its replay memory becomes too large. We re-examine the motivation for sampling uniformly over a replay memory, and find that it may be flawed when using function approximation.
arXiv Detail & Related papers (2021-02-22T19:29:18Z)
Deep Reinforcement Learning with Quantum-inspired Experience Replay [6.833294755109369]
A novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. The proposed deep reinforcement learning with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed times of each experience (also called transition) The experimental results on Atari 2600 games show that DRL-QER outperforms state-of-the-art algorithms such as DRL-PER and DCRL on most of these games with improved training efficiency.
arXiv Detail & Related papers (2021-01-06T13:52:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.