Related papers: Surpassing legacy approaches to PWR core reload optimization with single-objective Reinforcement learning

Surpassing legacy approaches to PWR core reload optimization with single-objective Reinforcement learning

URL: http://arxiv.org/abs/2402.11040v2
Date: Sun, 14 Jul 2024 14:45:52 GMT
Title: Surpassing legacy approaches to PWR core reload optimization with single-objective Reinforcement learning
Authors: Paul Seurin, Koroush Shirvan,
Abstract summary: We have developed methods based on Deep Reinforcement Learning (DRL) for both single- and multi-objective optimization. In this paper, we demonstrate the advantage of our RL-based approach, specifically using Proximal Policy Optimization (PPO) PPO adapts its search capability via a policy with learnable weights, allowing it to function as both a global and local search method.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimizing the fuel cycle cost through the optimization of nuclear reactor core loading patterns involves multiple objectives and constraints, leading to a vast number of candidate solutions that cannot be explicitly solved. To advance the state-of-the-art in core reload patterns, we have developed methods based on Deep Reinforcement Learning (DRL) for both single- and multi-objective optimization. Our previous research has laid the groundwork for these approaches and demonstrated their ability to discover high-quality patterns within a reasonable time frame. On the other hand, stochastic optimization (SO) approaches are commonly used in the literature, but there is no rigorous explanation that shows which approach is better in which scenario. In this paper, we demonstrate the advantage of our RL-based approach, specifically using Proximal Policy Optimization (PPO), against the most commonly used SO-based methods: Genetic Algorithm (GA), Parallel Simulated Annealing (PSA) with mixing of states, and Tabu Search (TS), as well as an ensemble-based method, Prioritized Replay Evolutionary and Swarm Algorithm (PESA). We found that the LP scenarios derived in this paper are amenable to a global search to identify promising research directions rapidly, but then need to transition into a local search to exploit these directions efficiently and prevent getting stuck in local optima. PPO adapts its search capability via a policy with learnable weights, allowing it to function as both a global and local search method. Subsequently, we compared all algorithms against PPO in long runs, which exacerbated the differences seen in the shorter cases. Overall, the work demonstrates the statistical superiority of PPO compared to the other considered algorithms.

Related papers

Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z)
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning [36.00719049772089]
We propose the Trust Region Preference Approximation (TRPA) algorithm. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability.
arXiv Detail & Related papers (2025-04-06T15:48:26Z)
EVAL: EigenVector-based Average-reward Learning [4.8748194765816955]
We develop approaches based on function approximation by neural networks. We show how our algorithm can also solve the average-reward RL problem without entropy-regularization.
arXiv Detail & Related papers (2025-01-15T19:00:45Z)
Graph-attention-based Casual Discovery with Trust Region-navigated Clipping Policy Optimization [13.75709067982844]
We propose a trust region-navigated clipping policy optimization method for causal discovery. We also propose a refined graph attention encoder called SDGAT to boost the efficient encoding of variables. With these improvements, the proposed method outperforms former RL method in both synthetic and benchmark datasets.
arXiv Detail & Related papers (2024-12-27T10:50:43Z)
Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function. We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z)
The Hitchhiker's Guide to Human Alignment with *PO [43.4130314879284]
We focus on identifying the algorithm that, while being performant, is simultaneously more robust to varying hyper parameters. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality.
arXiv Detail & Related papers (2024-07-21T17:35:20Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
DPO: Differential reinforcement learning with application to optimal configuration search [3.2857981869020327]
Reinforcement learning with continuous state and action spaces remains one of the most challenging problems within the field. We propose the first differential RL framework that can handle settings with limited training samples and short-length episodes.
arXiv Detail & Related papers (2024-04-24T03:11:12Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z)
Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems [2.6782615615913348]
This paper presents a new reinforcement learning approach that is based on entropy. In addition, we design an off-policy-based reinforcement learning technique that maximises the expected return. We show that our model can generalise to various route problems.
arXiv Detail & Related papers (2022-05-31T09:51:48Z)
Towards Applicable Reinforcement Learning: Improving the Generalization and Sample Efficiency with Policy Ensemble [43.95417785185457]
It is challenging for reinforcement learning algorithms to succeed in real-world applications like financial trading and logistic system. We propose Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner. EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods.
arXiv Detail & Related papers (2022-05-19T02:25:32Z)
Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains. There exist no guarantees on REPS's performance when using gradient-based solvers. We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.