Related papers: Q-WSL: Optimizing Goal-Conditioned RL with Weighted Supervised Learning via Dynamic Programming

Q-WSL: Optimizing Goal-Conditioned RL with Weighted Supervised Learning via Dynamic Programming

URL: http://arxiv.org/abs/2410.06648v4
Date: Tue, 22 Oct 2024 03:35:54 GMT
Title: Q-WSL: Optimizing Goal-Conditioned RL with Weighted Supervised Learning via Dynamic Programming
Authors: Xing Lei, Xuetao Zhang, Zifeng Zhuang, Donglin Wang,
Abstract summary: A novel class of advanced algorithms, termed GoalConditioned Weighted Supervised Learning (GCWSL), has recently emerged to tackle the challenges posed by sparse rewards in goal-conditioned reinforcement learning (RL) GCWSL consistently delivers strong performance across a diverse set of goal-reaching tasks due to its simplicity, effectiveness, and stability. However, GCWSL methods lack a crucial capability known as trajectory stitching, which is essential for learning optimal policies when faced with unseen skills during testing. In this paper, we propose Q-learning Weighted Supervised Learning (Q-WSL), a novel framework designed to overcome the limitations of GC
Score: 22.359171999254706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A novel class of advanced algorithms, termed Goal-Conditioned Weighted Supervised Learning (GCWSL), has recently emerged to tackle the challenges posed by sparse rewards in goal-conditioned reinforcement learning (RL). GCWSL consistently delivers strong performance across a diverse set of goal-reaching tasks due to its simplicity, effectiveness, and stability. However, GCWSL methods lack a crucial capability known as trajectory stitching, which is essential for learning optimal policies when faced with unseen skills during testing. This limitation becomes particularly pronounced when the replay buffer is predominantly filled with sub-optimal trajectories. In contrast, traditional TD-based RL methods, such as Q-learning, which utilize Dynamic Programming, do not face this issue but often experience instability due to the inherent difficulties in value function approximation. In this paper, we propose Q-learning Weighted Supervised Learning (Q-WSL), a novel framework designed to overcome the limitations of GCWSL by incorporating the strengths of Dynamic Programming found in Q-learning. Q-WSL leverages Dynamic Programming results to output the optimal action of (state, goal) pairs across different trajectories within the replay buffer. This approach synergizes the strengths of both Q-learning and GCWSL, effectively mitigating their respective weaknesses and enhancing overall performance. Empirical evaluations on challenging goal-reaching tasks demonstrate that Q-WSL surpasses other goal-conditioned approaches in terms of both performance and sample efficiency. Additionally, Q-WSL exhibits notable robustness in environments characterized by binary reward structures and environmental stochasticity.

Related papers

GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning [15.43938821214447]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs)<n>This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework.<n>GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance.
arXiv Detail & Related papers (2025-07-14T08:10:00Z)
Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic [12.837649598521102]
This paper introduces the Q-guided STein variational model predictive Actor-Critic (Q-STAC) framework for continuous control tasks.<n>Our method optimize control sequences directly using learned Q-values as objectives, eliminating the need for explicit cost function design.<n>Experiments on 2D navigation and robotic manipulation tasks demonstrate that Q-STAC achieves superior sample efficiency, robustness, and optimality compared to state-of-the-art algorithms.
arXiv Detail & Related papers (2025-07-09T07:53:53Z)
Scaling CrossQ with Weight Normalization [15.605124749589946]
CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1.<n>We identify challenges in the training dynamics which are emphasized by higher UTDs.<n>We propose a solution that stabilizes training, prevents potential loss of plasticity and keeps the effective learning rate constant.
arXiv Detail & Related papers (2025-06-04T09:24:17Z)
Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs [12.572869123617783]
Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks. PbRL presents a pioneering framework that capitalizes on human preferences as pivotal reward signals. We propose a LLM-enabled automatic preference generation framework named LLM4PG.
arXiv Detail & Related papers (2024-06-28T04:21:24Z)
Enhancing Q-Learning with Large Language Model Heuristics [0.0]
Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. We propose textbfLLM-guided Q-learning, a framework that leverages LLMs as hallucinations to aid in learning the Q-function for reinforcement learning.
arXiv Detail & Related papers (2024-05-06T10:42:28Z)
Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF [82.73541793388]
We introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based gradient (policy) algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.
arXiv Detail & Related papers (2024-02-10T04:54:15Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Multi-Objective Reinforcement Learning-based Approach for Pressurized Water Reactor Optimization [0.0]
PEARL distinguishes itself from traditional policy-based multi-objective Reinforcement Learning methods by learning a single policy. Several versions inspired from deep learning and evolutionary techniques have been crafted, catering to both unconstrained and constrained problem domains. It is tested on two practical PWR core Loading Pattern optimization problems to showcase its real-world applicability.
arXiv Detail & Related papers (2023-12-15T20:41:09Z)
Goal-Conditioned Q-Learning as Knowledge Distillation [136.79415677706612]
We explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals.
arXiv Detail & Related papers (2022-08-28T22:01:10Z)
Rethinking Goal-conditioned Supervised Learning and Its Connection to Offline RL [49.26825108780872]
Goal-Conditioned Supervised Learning (GCSL) provides a new learning framework by iteratively relabeling and imitating self-generated experiences. We extend GCSL as a novel offline goal-conditioned RL algorithm. We show that WGCSL can consistently outperform GCSL and existing state-of-the-art offline methods.
arXiv Detail & Related papers (2022-02-09T14:17:05Z)
SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms. SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
Robust Reinforcement Learning via Adversarial training with Langevin Dynamics [51.234482917047835]
We introduce a sampling perspective to tackle the challenging task of training robust Reinforcement Learning (RL) agents. We present a novel, scalable two-player RL algorithm, which is a sampling variant of the two-player policy method.
arXiv Detail & Related papers (2020-02-14T14:59:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.