The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2506.13672v1
- Date: Mon, 16 Jun 2025 16:30:00 GMT
- Title: The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning
- Authors: Jiashun Liu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Ling Pan,
- Abstract summary: Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning.<n>We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy.<n>We propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination.
- Score: 19.01686700722506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning. This can help improve sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it can have the effect of "polluting" the replay buffer with data which can exacerbate optimization challenges in addition to wasting environment interactions due to wasteful sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy, which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination based on Q-value and gradient statistics, which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.
Related papers
- Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound [4.350004414611934]
We introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL.<n>Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss.<n>This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded.
arXiv Detail & Related papers (2025-07-15T12:46:25Z) - TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z) - Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training [71.16258800411696]
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training.<n>Existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers.<n>We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA)
arXiv Detail & Related papers (2025-03-24T17:51:39Z) - Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering.
We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z) - Offline Reinforcement Learning from Datasets with Structured Non-Stationarity [50.35634234137108]
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy.
We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode.
We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation.
arXiv Detail & Related papers (2024-05-23T02:41:36Z) - OER: Offline Experience Replay for Continual Offline Reinforcement Learning [25.985985377992034]
Continuously learning new skills via a sequence of pre-collected offline datasets is desired for an agent.
In this paper, we formulate a new setting, continual offline reinforcement learning (CORL), where an agent learns a sequence of offline reinforcement learning tasks.
We propose a new model-based experience selection scheme to build the replay buffer, where a transition model is learned to approximate the state distribution.
arXiv Detail & Related papers (2023-05-23T08:16:44Z) - Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses.
We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z) - Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy
Reinforcement Learning [17.3794999533024]
We show that deep RL appears to struggle in the presence of extraneous data.
Recent works have shown that the performance of Deep Q-Network (DQN) degrades when its replay memory becomes too large.
We re-examine the motivation for sampling uniformly over a replay memory, and find that it may be flawed when using function approximation.
arXiv Detail & Related papers (2021-02-22T19:29:18Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.