Related papers: Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

URL: http://arxiv.org/abs/2502.00288v1
Date: Sat, 01 Feb 2025 03:04:53 GMT
Title: Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network
Authors: Jijia Liu, Feng Gao, Qingmin Liao, Chao Yu, Yu Wang,
Abstract summary: We propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner.<n>ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks.<n>It auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks.
Score: 23.481553466650453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to "kick-start" training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average $1.62\times$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.

Related papers

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z)
Reinforcement Learning with Action Chunking [56.838297900091426]
We present Q-chunking, a recipe for improving reinforcement learning algorithms for long-horizon, sparse-reward tasks.<n>Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning.<n>Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
arXiv Detail & Related papers (2025-07-10T17:48:03Z)
Pretraining a Shared Q-Network for Data-Efficient Offline Reinforcement Learning [9.981340960529185]
offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment.<n>We propose a plug-and-play pretraining method to initialize a feature of a Q-network to enhance data efficiency in offline RL.<n>We show that our method significantly boosts data-efficient offline RL across various data qualities and data distributions trough D4RL and ExoRL benchmarks.
arXiv Detail & Related papers (2025-05-09T00:26:01Z)
Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models [79.2162092822111]
We systematically evaluate reinforcement learning (RL) and control-based methods on a suite of navigation tasks.<n>We employ a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and employ it for planning.<n>Our results show that model-free RL benefits most from large amounts of high-quality data, whereas model-based planning generalizes better to unseen layouts.
arXiv Detail & Related papers (2025-02-20T18:39:41Z)
SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning [51.10866035483686]
High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability. We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases. During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data.
arXiv Detail & Related papers (2025-01-15T09:04:19Z)
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning [99.33607114541861]
We propose a new benchmark for offline RL that focuses on realistic simulations of robotic manipulation and locomotion environments. Our proposed benchmark covers state-based and image-based domains, and supports both offline RL and online fine-tuning evaluation.
arXiv Detail & Related papers (2024-08-15T22:27:00Z)
Continuous Control with Coarse-to-fine Reinforcement Learning [15.585706638252441]
We present a framework that trains RL agents to zoom-into a continuous action space in a coarse-to-fine manner. We introduce a concrete, value-based algorithm within the framework called Coarse-to-fine Q-Network (CQN) CQN robustly learns to solve real-world manipulation tasks within a few minutes of online training.
arXiv Detail & Related papers (2024-07-10T16:04:08Z)
Equivariant Offline Reinforcement Learning [7.822389399560674]
We investigate the use of $SO(2)$-equivariant neural networks for offline RL with a limited number of demonstrations. Our experimental results show that equivariant versions of Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) outperform their non-equivariant counterparts.
arXiv Detail & Related papers (2024-06-20T03:02:49Z)
CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning [62.58375643251612]
We propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.
arXiv Detail & Related papers (2023-12-19T14:26:23Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
Digital Twin Assisted Deep Reinforcement Learning for Online Admission Control in Sliced Network [19.152875040151976]
We propose a digital twin (DT) accelerated DRL solution to address this issue. A neural network-based DT is established with a customized output layer for queuing systems, trained through supervised learning, and then employed to assist the training phase of the DRL model. Extensive simulations show that the DT-accelerated DRL improves resource utilization by over 40% compared to the directly trained state-of-the-art dueling deep Q-learning model.
arXiv Detail & Related papers (2023-10-07T09:09:19Z)
Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes [100.69714600180895]
offline Q-learning algorithms exhibit strong performance that scales with model capacity. We train a single policy on 40 games with near-human performance using up-to 80 million parameter networks. Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal.
arXiv Detail & Related papers (2022-11-28T08:56:42Z)
Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL [0.0]
Decision Transformer (DT) combines the conditional policy approach and a transformer architecture. DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy. We propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT.
arXiv Detail & Related papers (2022-09-08T18:26:39Z)
Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE) MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary. In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z)
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets [84.94748183816547]
We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.
arXiv Detail & Related papers (2020-06-16T17:54:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.