Model-Free $\mu$-Synthesis: A Nonsmooth Optimization Perspective
- URL: http://arxiv.org/abs/2402.11654v1
- Date: Sun, 18 Feb 2024 17:17:17 GMT
- Title: Model-Free $\mu$-Synthesis: A Nonsmooth Optimization Perspective
- Authors: Darioush Keivan, Xingang Guo, Peter Seiler, Geir Dullerud, Bin Hu
- Abstract summary: In this paper, we revisit an important policy search benchmark, namely $mu$-synthesis.
We show that subgradient-based search methods have led to impressive numerical results in practice.
- Score: 4.477225073240389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we revisit model-free policy search on an important robust
control benchmark, namely $\mu$-synthesis. In the general output-feedback
setting, there do not exist convex formulations for this problem, and hence
global optimality guarantees are not expected. Apkarian (2011) presented a
nonconvex nonsmooth policy optimization approach for this problem, and achieved
state-of-the-art design results via using subgradient-based policy search
algorithms which generate update directions in a model-based manner. Despite
the lack of convexity and global optimality guarantees, these subgradient-based
policy search methods have led to impressive numerical results in practice.
Built upon such a policy optimization persepctive, our paper extends these
subgradient-based search methods to a model-free setting. Specifically, we
examine the effectiveness of two model-free policy optimization strategies: the
model-free non-derivative sampling method and the zeroth-order policy search
with uniform smoothing. We performed an extensive numerical study to
demonstrate that both methods consistently replicate the design outcomes
achieved by their model-based counterparts. Additionally, we provide some
theoretical justifications showing that convergence guarantees to stationary
points can be established for our model-free $\mu$-synthesis under some
assumptions related to the coerciveness of the cost function. Overall, our
results demonstrate that derivative-free policy optimization offers a
competitive and viable approach for solving general output-feedback
$\mu$-synthesis problems in the model-free setting.
Related papers
- Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [61.580419063416734]
A recent stream of structured learning approaches has improved the practical state of the art for a range of optimization problems.
The key idea is to exploit the statistical distribution over instances instead of dealing with instances separately.
In this article, we investigate methods that smooth the risk by perturbing the policy, which eases optimization and improves the generalization error.
arXiv Detail & Related papers (2024-07-24T12:00:30Z) - Model-Free Active Exploration in Reinforcement Learning [53.786439742572995]
We study the problem of exploration in Reinforcement Learning and present a novel model-free solution.
Our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches.
arXiv Detail & Related papers (2024-06-30T19:00:49Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Policy Gradient Method For Robust Reinforcement Learning [23.62008807533706]
This paper develops the first policy gradient method with global optimality guarantee and complexity analysis for robust reinforcement learning under model mismatch.
We show that the proposed robust policy gradient method converges to the global optimum gradient under direct policy parameterization.
We then extend our methodology to the general model-free setting and design the robust actoriable parametric policy class and value function.
arXiv Detail & Related papers (2022-05-15T17:35:17Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.