Programming by Rewards
- URL: http://arxiv.org/abs/2007.06835v1
- Date: Tue, 14 Jul 2020 05:49:14 GMT
- Title: Programming by Rewards
- Authors: Nagarajan Natarajan, Ajaykrishna Karthikeyan, Prateek Jain, Ivan
Radicek, Sriram Rajamani, Sumit Gulwani, Johannes Gehrke
- Abstract summary: We formalize and study programming by rewards'' (PBR), a new approach for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark.
A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run)
We show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense.
- Score: 20.626369097817715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We formalize and study ``programming by rewards'' (PBR), a new approach for
specifying and synthesizing subroutines for optimizing some quantitative metric
such as performance, resource utilization, or correctness over a benchmark. A
PBR specification consists of (1) input features $x$, and (2) a reward function
$r$, modeled as a black-box component (which we can only run), that assigns a
reward for each execution. The goal of the synthesizer is to synthesize a
"decision function" $f$ which transforms the features to a decision value for
the black-box component so as to maximize the expected reward $E[r \circ f
(x)]$ for executing decisions $f(x)$ for various values of $x$. We consider a
space of decision functions in a DSL of loop-free if-then-else programs, which
can branch on linear functions of the input features in a tree-structure and
compute a linear function of the inputs in the leaves of the tree. We find that
this DSL captures decision functions that are manually written in practice by
programmers. Our technical contribution is the use of continuous-optimization
techniques to perform synthesis of such decision functions as if-then-else
programs. We also show that the framework is theoretically-founded ---in cases
when the rewards satisfy nice properties, the synthesized code is optimal in a
precise sense.
We have leveraged PBR to synthesize non-trivial decision functions related to
search and ranking heuristics in the PROSE codebase (an industrial strength
program synthesis framework) and achieve competitive results to manually
written procedures over multiple man years of tuning. We present empirical
evaluation against other baseline techniques over real-world case studies
(including PROSE) as well on simple synthetic benchmarks.
Related papers
- Composite Bayesian Optimization In Function Spaces Using NEON -- Neural Epistemic Operator Networks [4.1764890353794994]
NEON is an architecture for generating predictions with uncertainty using a single operator network backbone.
We show that NEON achieves state-of-the-art performance while requiring orders of magnitude less trainable parameters.
arXiv Detail & Related papers (2024-04-03T22:42:37Z) - BOIS: Bayesian Optimization of Interconnected Systems [0.0]
We introduce a new paradigm which allows for the efficient use of composite functions in BO.
We show that this simple approach (which we call BOIS) enables the exploitation of structural knowledge.
Our results indicate that BOIS achieves performance gains and accurately captures the statistics of composite functions.
arXiv Detail & Related papers (2023-11-19T06:44:13Z) - Bayesian Optimization for Function Compositions with Applications to
Dynamic Pricing [0.0]
We propose a practical BO method of function compositions where the form of the composition is known but the constituent functions are expensive to evaluate.
We demonstrate a novel application to dynamic pricing in revenue management when the underlying demand function is expensive to evaluate.
arXiv Detail & Related papers (2023-03-21T15:45:06Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Efficient computation of the Knowledge Gradient for Bayesian
Optimization [1.0497128347190048]
One-shot Hybrid KG is a new approach that combines several of the previously proposed ideas and is cheap to compute as well as powerful and efficient.
All experiments are implemented in BOTorch and show empirically drastically reduced computational overhead with equal or improved performance.
arXiv Detail & Related papers (2022-09-30T10:39:38Z) - Submodular + Concave [53.208470310734825]
It has been well established that first order optimization methods can converge to the maximal objective value of concave functions.
In this work, we initiate the determinant of the smooth functions convex body $$F(x) = G(x) +C(x)$.
This class of functions is an extension of both concave and continuous DR-submodular functions for which no guarantee is known.
arXiv Detail & Related papers (2021-06-09T01:59:55Z) - Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations.
This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation.
We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Finding Global Minima via Kernel Approximations [90.42048080064849]
We consider the global minimization of smooth functions based solely on function evaluations.
In this paper, we consider an approach that jointly models the function to approximate and finds a global minimum.
arXiv Detail & Related papers (2020-12-22T12:59:30Z) - Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies.
First, we propose reward machines, a type of finite state machine that supports the specification of reward functions.
We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.