Related papers: Programming by Rewards

Programming by Rewards

URL: http://arxiv.org/abs/2007.06835v1
Date: Tue, 14 Jul 2020 05:49:14 GMT
Title: Programming by Rewards
Authors: Nagarajan Natarajan, Ajaykrishna Karthikeyan, Prateek Jain, Ivan Radicek, Sriram Rajamani, Sumit Gulwani, Johannes Gehrke
Abstract summary: We formalize and study programming by rewards'' (PBR), a new approach for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run) We show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense.
Score: 20.626369097817715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward for each execution. The goal of the synthesizer is to synthesize a "decision function" $f$ which transforms the features to a decision value for the black-box component so as to maximize the expected reward $E[r \circ f (x)]$ for executing decisions $f(x)$ for various values of $x$. We consider a space of decision functions in a DSL of loop-free if-then-else programs, which can branch on linear functions of the input features in a tree-structure and compute a linear function of the inputs in the leaves of the tree. We find that this DSL captures decision functions that are manually written in practice by programmers. Our technical contribution is the use of continuous-optimization techniques to perform synthesis of such decision functions as if-then-else programs. We also show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense. We have leveraged PBR to synthesize non-trivial decision functions related to search and ranking heuristics in the PROSE codebase (an industrial strength program synthesis framework) and achieve competitive results to manually written procedures over multiple man years of tuning. We present empirical evaluation against other baseline techniques over real-world case studies (including PROSE) as well on simple synthetic benchmarks.

Related papers

Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment. We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z)
Composite Bayesian Optimization In Function Spaces Using NEON -- Neural Epistemic Operator Networks [4.1764890353794994]
NEON is an architecture for generating predictions with uncertainty using a single operator network backbone. We show that NEON achieves state-of-the-art performance while requiring orders of magnitude less trainable parameters.
arXiv Detail & Related papers (2024-04-03T22:42:37Z)
BOIS: Bayesian Optimization of Interconnected Systems [0.0]
We introduce a new paradigm which allows for the efficient use of composite functions in BO. We show that this simple approach (which we call BOIS) enables the exploitation of structural knowledge. Our results indicate that BOIS achieves performance gains and accurately captures the statistics of composite functions.
arXiv Detail & Related papers (2023-11-19T06:44:13Z)
Bayesian Optimization for Function Compositions with Applications to Dynamic Pricing [0.0]
We propose a practical BO method of function compositions where the form of the composition is known but the constituent functions are expensive to evaluate. We demonstrate a novel application to dynamic pricing in revenue management when the underlying demand function is expensive to evaluate.
arXiv Detail & Related papers (2023-03-21T15:45:06Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z)
Efficient computation of the Knowledge Gradient for Bayesian Optimization [1.0497128347190048]
One-shot Hybrid KG is a new approach that combines several of the previously proposed ideas and is cheap to compute as well as powerful and efficient. All experiments are implemented in BOTorch and show empirically drastically reduced computational overhead with equal or improved performance.
arXiv Detail & Related papers (2022-09-30T10:39:38Z)
Submodular + Concave [53.208470310734825]
It has been well established that first order optimization methods can converge to the maximal objective value of concave functions. In this work, we initiate the determinant of the smooth functions convex body $$F(x) = G(x) +C(x)$. This class of functions is an extension of both concave and continuous DR-submodular functions for which no guarantee is known.
arXiv Detail & Related papers (2021-06-09T01:59:55Z)
Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations. This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation. We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback. We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode. We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z)
Finding Global Minima via Kernel Approximations [90.42048080064849]
We consider the global minimization of smooth functions based solely on function evaluations. In this paper, we consider an approach that jointly models the function to approximate and finds a global minimum.
arXiv Detail & Related papers (2020-12-22T12:59:30Z)
Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.