The Sample Complexity of Online Contract Design
- URL: http://arxiv.org/abs/2211.05732v3
- Date: Fri, 19 May 2023 23:18:55 GMT
- Title: The Sample Complexity of Online Contract Design
- Authors: Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao,
and Michael I. Jordan
- Abstract summary: We study the hidden-action principal-agent problem in an online setting.
In each round, the principal posts a contract that specifies the payment to the agent based on each outcome.
The agent then makes a strategic choice of action that maximizes her own utility, but the action is not directly observable by the principal.
- Score: 120.9833763323407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the hidden-action principal-agent problem in an online setting. In
each round, the principal posts a contract that specifies the payment to the
agent based on each outcome. The agent then makes a strategic choice of action
that maximizes her own utility, but the action is not directly observable by
the principal. The principal observes the outcome and receives utility from the
agent's choice of action. Based on past observations, the principal dynamically
adjusts the contracts with the goal of maximizing her utility.
We introduce an online learning algorithm and provide an upper bound on its
Stackelberg regret. We show that when the contract space is $[0,1]^m$, the
Stackelberg regret is upper bounded by $\widetilde O(\sqrt{m} \cdot
T^{1-1/(2m+1)})$, and lower bounded by $\Omega(T^{1-1/(m+2)})$, where
$\widetilde O$ omits logarithmic factors. This result shows that
exponential-in-$m$ samples are sufficient and necessary to learn a near-optimal
contract, resolving an open problem on the hardness of online contract design.
Moreover, when contracts are restricted to some subset $\mathcal{F} \subset
[0,1]^m$, we define an intrinsic dimension of $\mathcal{F}$ that depends on the
covering number of the spherical code in the space and bound the regret in
terms of this intrinsic dimension. When $\mathcal{F}$ is the family of linear
contracts, we show that the Stackelberg regret grows exactly as
$\Theta(T^{2/3})$.
The contract design problem is challenging because the utility function is
discontinuous. Bounding the discretization error in this setting has been an
open problem. In this paper, we identify a limited set of directions in which
the utility function is continuous, allowing us to design a new discretization
method and bound its error. This approach enables the first upper bound with no
restrictions on the contract and action space.
Related papers
- Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Accommodating Picky Customers: Regret Bound and Exploration Complexity
for Multi-Objective Reinforcement Learning [43.75491612671571]
We consider multi-objective reinforcement learning where the objectives are balanced using preferences.
We formalize this problem as an episodic learning problem on a Markov decision process.
We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $widetildemathcalObigl(sqrtmind,Scdot H3 SA/epsilon2bigr)$.
arXiv Detail & Related papers (2020-11-25T21:45:04Z) - Differentially Private Online Submodular Maximization [16.422215672356167]
We consider the problem of online submodular functions under a cardinality constraint with differential privacy feed (DP)
A stream of $T$ submodular functions over a common finite ground set $U$ arrives online, and at each time-step the decision maker must choose most $k$ elements of $U$ before observing the function.
The decision maker obtains a payoff equal to the function evaluated on the chosen set, and aims to learn a sequence of sets that achieves low expected regret.
arXiv Detail & Related papers (2020-10-24T07:23:30Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z) - Logistic Regression Regret: What's the Catch? [3.7311680121118345]
We derive lower bounds with logarithmic regret under $L_infty$ constraints on the parameters.
For $L$ constraints, it is shown that for large enough $d$, the regret remains linear in $d$ but no longer logarithmic in $T$.
arXiv Detail & Related papers (2020-02-07T18:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.