Solving Multistage Stochastic Linear Programming via Regularized Linear
Decision Rules: An Application to Hydrothermal Dispatch Planning
- URL: http://arxiv.org/abs/2110.03146v1
- Date: Thu, 7 Oct 2021 02:36:14 GMT
- Title: Solving Multistage Stochastic Linear Programming via Regularized Linear
Decision Rules: An Application to Hydrothermal Dispatch Planning
- Authors: Felipe Nazare and Alexandre Street
- Abstract summary: We propose a novel regularization scheme for linear decision rules (LDR) based on the AdaSO (adaptive least absolute shrinkage and selection operator)
Experiments show that the overfit threat is non-negligible when using the classical non-regularized LDR to solve MSLP.
For the LHDP problem, our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark.
- Score: 77.34726150561087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The solution of multistage stochastic linear problems (MSLP) represents a
challenge for many applications. Long-term hydrothermal dispatch planning
(LHDP) materializes this challenge in a real-world problem that affects
electricity markets, economies, and natural resources worldwide. No closed-form
solutions are available for MSLP and the definition of non-anticipative
policies with high-quality out-of-sample performance is crucial. Linear
decision rules (LDR) provide an interesting simulation-based framework for
finding high-quality policies to MSLP through two-stage stochastic models. In
practical applications, however, the number of parameters to be estimated when
using an LDR may be close or higher than the number of scenarios, thereby
generating an in-sample overfit and poor performances in out-of-sample
simulations. In this paper, we propose a novel regularization scheme for LDR
based on the AdaLASSO (adaptive least absolute shrinkage and selection
operator). The goal is to use the parsimony principle as largely studied in
high-dimensional linear regression models to obtain better out-of-sample
performance for an LDR applied to MSLP. Computational experiments show that the
overfit threat is non-negligible when using the classical non-regularized LDR
to solve MSLP. For the LHDP problem, our analysis highlights the following
benefits of the proposed framework in comparison to the non-regularized
benchmark: 1) significant reductions in the number of non-zero coefficients
(model parsimony), 2) substantial cost reductions in out-of-sample evaluations,
and 3) improved spot-price profiles.
Related papers
- Solving Finite-Horizon MDPs via Low-Rank Tensors [9.072279909866845]
We study the problem of learning optimal policies in finite-horizon Markov Decision Processes (MDPs)
In finite-horizon MDPs, the policies, and therefore the value functions (VFs) are not stationary.
We propose modeling the VFs of finite-horizon MDPs as low-rank tensors, enabling a scalable representation that renders the problem of learning optimal policies tractable.
arXiv Detail & Related papers (2025-01-17T23:10:50Z) - Zeroth-Order Policy Gradient for Reinforcement Learning from Human
Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference.
The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator.
Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - A Moreau Envelope Approach for LQR Meta-Policy Estimation [0.7311194870168775]
We study the problem of policy estimation for the Linear Quadratic Regulator (LQR) in discrete-time linear time-invariant uncertain dynamical systems.
We propose a surrogate LQR cost, built from a finite set of realizations of the uncertain system, to define a meta-policy efficiently adjustable to new realizations.
arXiv Detail & Related papers (2024-03-26T04:02:09Z) - Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement
Learning in Discounted Linear MDPs [16.006893624836554]
We propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE)
VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step.
In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct.
arXiv Detail & Related papers (2023-10-17T18:27:27Z) - Provably Efficient Algorithm for Nonstationary Low-Rank MDPs [48.92657638730582]
We make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time.
We propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL.
For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with sample complexity.
arXiv Detail & Related papers (2023-08-10T09:52:44Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.