Related papers: Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data

URL: http://arxiv.org/abs/2509.10303v1
Date: Fri, 12 Sep 2025 14:45:39 GMT
Title: Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data
Authors: Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang,
Abstract summary: Conservative Quantile Actor-Critic (CDQAC) learns effective scheduling policies directly from historical data.<n>CDQAC consistently outperforms the original data-generatings and surpasses state-of-the-art offline online baselines.
Score: 2.0718953516814103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Job-Shop Scheduling Problem (JSP) and Flexible Job-Shop Scheduling Problem (FJSP), are canonical combinatorial optimization problems with wide-ranging applications in industrial operations. In recent years, many online reinforcement learning (RL) approaches have been proposed to learn constructive heuristics for JSP and FJSP. Although effective, these online RL methods require millions of interactions with simulated environments that may not capture real-world complexities, and their random policy initialization leads to poor sample efficiency. To address these limitations, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline RL algorithm that learns effective scheduling policies directly from historical data, eliminating the need for costly online interactions, while maintaining the ability to improve upon suboptimal training data. CDQAC couples a quantile-based critic with a delayed policy update, estimating the return distribution of each machine-operation pair rather than selecting pairs outright. Our extensive experiments demonstrate CDQAC's remarkable ability to learn from diverse data sources. CDQAC consistently outperforms the original data-generating heuristics and surpasses state-of-the-art offline and online RL baselines. In addition, CDQAC is highly sample efficient, requiring only 10-20 training instances to learn high-quality policies. Surprisingly, we find that CDQAC performs better when trained on data generated by a random heuristic than when trained on higher-quality data from genetic algorithms and priority dispatching rules.

Related papers

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning [24.46783760408068]
offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction.<n>Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training.<n>We propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that dynamically balances RL and behavior cloning (BC) during training.
arXiv Detail & Related papers (2025-08-27T14:00:18Z)
Pretraining a Shared Q-Network for Data-Efficient Offline Reinforcement Learning [9.981340960529185]
offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment.<n>We propose a plug-and-play pretraining method to initialize a feature of a Q-network to enhance data efficiency in offline RL.<n>We show that our method significantly boosts data-efficient offline RL across various data qualities and data distributions trough D4RL and ExoRL benchmarks.
arXiv Detail & Related papers (2025-05-09T00:26:01Z)
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning [62.984693936073974]
Value-based reinforcement learning can learn effective policies for a wide range of multi-turn problems.<n>Current value-based RL methods have proven particularly challenging to scale to the setting of large language models.<n>We propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning problem.
arXiv Detail & Related papers (2024-11-07T21:36:52Z)
Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling [0.9831489366502301]
Job Shop Scheduling Problem (JSSP) is a complex optimization problem.<n>Online Reinforcement Learning (RL) has shown promise by quickly finding acceptable solutions for JSSP.<n>We introduce Offline Learned Dispatching (Offline-LD), an offline reinforcement learning approach for JSSP.
arXiv Detail & Related papers (2024-09-16T15:18:10Z)
Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online. We extensively ablate these design choices, demonstrating the key factors that most affect performance. We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z)
FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z)
Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. We introduce a new RL formulation for text generation from the soft Q-learning perspective. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR) We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.