Sample-efficient Iterative Lower Bound Optimization of Deep Reactive
Policies for Planning in Continuous MDPs
- URL: http://arxiv.org/abs/2203.12679v1
- Date: Wed, 23 Mar 2022 19:06:16 GMT
- Title: Sample-efficient Iterative Lower Bound Optimization of Deep Reactive
Policies for Planning in Continuous MDPs
- Authors: Siow Meng Low, Akshat Kumar, Scott Sanner
- Abstract summary: In this work, we take a minorization-maximization perspective to iteratively optimize the.
w.r.t. a locally tight lower-bounded objective.
This novel formulation of learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective.
Empirical evaluation confirms that ILBO is significantly more sample-efficient than the state-of-the-art planner.
- Score: 27.41101006357176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in deep learning have enabled optimization of deep reactive
policies (DRPs) for continuous MDP planning by encoding a parametric policy as
a deep neural network and exploiting automatic differentiation in an end-to-end
model-based gradient descent framework. This approach has proven effective for
optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of
sampled trajectories to learn effectively and can suffer from high variance in
solution quality. In this work, we revisit the overall model-based DRP
objective and instead take a minorization-maximization perspective to
iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective.
This novel formulation of DRP learning as iterative lower bound optimization
(ILBO) is particularly appealing because (i) each step is structurally easier
to optimize than the overall objective, (ii) it guarantees a monotonically
improving objective under certain theoretical conditions, and (iii) it reuses
samples between iterations thus lowering sample complexity. Empirical
evaluation confirms that ILBO is significantly more sample-efficient than the
state-of-the-art DRP planner and consistently produces better solution quality
with lower variance. We additionally demonstrate that ILBO generalizes well to
new problem instances (i.e., different initial states) without requiring
retraining.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - $i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization [12.266207199002604]
Large Language Models (LLM) can sometimes produce outputs that deviate from human expectations.
We propose a novel framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization.
We show that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators.
arXiv Detail & Related papers (2024-05-24T05:42:11Z) - RoPINN: Region Optimized Physics-Informed Neural Networks [66.38369833561039]
Physics-informed neural networks (PINNs) have been widely applied to solve partial differential equations (PDEs)
This paper proposes and theoretically studies a new training paradigm as region optimization.
A practical training algorithm, Region Optimized PINN (RoPINN), is seamlessly derived from this new paradigm.
arXiv Detail & Related papers (2024-05-23T09:45:57Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Edge-Efficient Deep Learning Models for Automatic Modulation Classification: A Performance Analysis [0.7428236410246183]
We investigate optimized convolutional neural networks (CNNs) developed for automatic modulation classification (AMC) of wireless signals.
We propose optimized models with the combinations of these techniques to fuse the complementary optimization benefits.
The experimental results show that the proposed individual and combined optimization techniques are highly effective for developing models with significantly less complexity.
arXiv Detail & Related papers (2024-04-11T06:08:23Z) - Beyond Single-Model Views for Deep Learning: Optimization versus
Generalizability of Stochastic Optimization Algorithms [13.134564730161983]
This paper adopts a novel approach to deep learning optimization, focusing on gradient descent (SGD) and its variants.
We show that SGD and its variants demonstrate performance on par with flat-minimas like SAM, albeit with half the gradient evaluations.
Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD and noise-enabled variants.
arXiv Detail & Related papers (2024-03-01T14:55:22Z) - Regularization and Variance-Weighted Regression Achieves Minimax
Optimality in Linear MDPs: Theory and Practice [79.48432795639403]
Mirror descent value iteration (MDVI) is an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL)
We study MDVI with linear function approximation through its sample complexity required to identify an $varepsilon$-optimal policy.
We present Variance-Weighted Least-Squares MDVI, the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs.
arXiv Detail & Related papers (2023-05-22T16:13:05Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning
Optimization Landscape [15.362190838843915]
We show that LPF-SGD converges to a better optimal point with smaller generalization error than SGD.
We show that our algorithm achieves superior generalization performance compared to the common DL training strategies.
arXiv Detail & Related papers (2022-01-20T07:13:04Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.