Related papers: Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems

Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems

URL: http://arxiv.org/abs/2303.12981v3
Date: Sat, 30 Sep 2023 17:58:43 GMT
Title: Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems
Authors: Sihan Zeng, Thinh T. Doan, Justin Romberg
Abstract summary: We show that the superlevel set of the objective function with respect to the policy parameter is always a connected set. We show that the optimization objective as a function of the policy parameter and reward satisfies a stronger "equiconnectedness" property. This is the first time such a result is established in the literature.
Score: 15.632127097145881
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the optimization objective as a function of the policy parameter and reward satisfies a stronger "equiconnectedness" property. To our best knowledge, these are novel and previously unknown discoveries. We present an application of the connectedness of these superlevel sets to the derivation of minimax theorems for robust reinforcement learning. We show that any minimax optimization program which is convex on one side and is equiconnected on the other side observes the minimax equality (i.e. has a Nash equilibrium). We find that this exact structure is exhibited by an interesting robust reinforcement learning problem under an adversarial reward attack, and the validity of its minimax equality immediately follows. This is the first time such a result is established in the literature.

Related papers

Learning to chain-of-thought with Jensen's evidence lower bound [27.21190979806394]
We propose a way to optimize chain-of-thought with reinforcement learning, but without external reward function. Our algorithm relies on viewing chain-of-thought as latent variable as part of a probabilistic inference problem. We show that optimizing with Jensen's lower bound is as effective as policy gradient with external reward.
arXiv Detail & Related papers (2025-03-25T13:03:09Z)
Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in Reinforcement Learning [27.868175900131313]
Reinforcement learning (RL) aims to estimate the action to take given a (time-varying) state. This paper postulates multi-linear mappings to efficiently estimate the parameters of the RL policy. We leverage the PARAFAC decomposition to design tensor low-rank policies.
arXiv Detail & Related papers (2025-01-08T23:22:08Z)
A Novel Variational Lower Bound for Inverse Reinforcement Learning [5.370126167091961]
Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories. We present a new Variational Lower Bound for IRL (VLB-IRL) Our method simultaneously learns the reward function and policy under the learned reward function.
arXiv Detail & Related papers (2023-11-07T03:50:43Z)
Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods. Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback. We consider the general reward setting where the reward can be defined over the whole trajectory. We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z)
Topological Guided Actor-Critic Modular Learning of Continuous Systems with Temporal Objectives [2.398608007786179]
This work investigates the formal policy synthesis of continuous-state dynamic systems given high-level specifications in linear temporal logic. We use neural networks to approximate the value function and policy function for hybrid product state space.
arXiv Detail & Related papers (2023-04-20T01:36:05Z)
Unsupervised Learning for Combinatorial Optimization with Principled Objective Relaxation [19.582494782591386]
This work proposes an unsupervised learning framework for optimization (CO) problems. Our key contribution is the observation that if the relaxed objective satisfies entry-wise concavity, a low optimization loss guarantees the quality of the final integral solutions. In particular, this observation can guide the design of objective models in applications where the objectives are not given explicitly while requiring being modeled in prior.
arXiv Detail & Related papers (2022-07-13T06:44:17Z)
Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer [7.970144204429356]
We introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set. We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks.
arXiv Detail & Related papers (2022-06-22T19:00:08Z)
Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandit Approach [65.27783264330711]
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity. We devise algorithms learning optimal tilt control policies from existing data. We show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
arXiv Detail & Related papers (2022-01-06T18:24:30Z)
Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates [63.58053355357644]
We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks. We show theoretically that having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance.
arXiv Detail & Related papers (2021-12-30T12:20:46Z)
Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency [83.02999769628593]
We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning. We show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions. We present the first finite-sample result with first-order efficiency in non-tabular environments.
arXiv Detail & Related papers (2021-02-05T03:20:39Z)
The Landscape of the Proximal Point Method for Nonconvex-Nonconcave Minimax Optimization [10.112779201155005]
Minimax PPM has become a central tool in machine learning with applications in robust, reinforcement learning, GANs, etc. These applications are often non-concave, but existing theory is unable to identify this and the fundamental difficulties.
arXiv Detail & Related papers (2020-06-15T18:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.