Contextual Online Pricing with (Biased) Offline Data
- URL: http://arxiv.org/abs/2507.02762v1
- Date: Thu, 03 Jul 2025 16:21:49 GMT
- Title: Contextual Online Pricing with (Biased) Offline Data
- Authors: Yixuan Zhang, Ruihao Zhu, Qiaomin Xie,
- Abstract summary: We study contextual online pricing with biased offline data.<n>An Optimism-in-the-Face-of-Uncertainty (OFU) policy achieves a minimax-optimal, instance-dependent regret bound $tildemathcalObig.
- Score: 22.723289890639723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study contextual online pricing with biased offline data. For the scalar price elasticity case, we identify the instance-dependent quantity $\delta^2$ that measures how far the offline data lies from the (unknown) online optimum. We show that the time length $T$, bias bound $V$, size $N$ and dispersion $\lambda_{\min}(\hat{\Sigma})$ of the offline data, and $\delta^2$ jointly determine the statistical complexity. An Optimism-in-the-Face-of-Uncertainty (OFU) policy achieves a minimax-optimal, instance-dependent regret bound $\tilde{\mathcal{O}}\big(d\sqrt{T} \wedge (V^2T + \frac{dT}{\lambda_{\min}(\hat{\Sigma}) + (N \wedge T) \delta^2})\big)$. For general price elasticity, we establish a worst-case, minimax-optimal rate $\tilde{\mathcal{O}}\big(d\sqrt{T} \wedge (V^2T + \frac{dT }{\lambda_{\min}(\hat{\Sigma})})\big)$ and provide a generalized OFU algorithm that attains it. When the bias bound $V$ is unknown, we design a robust variant that always guarantees sub-linear regret and strictly improves on purely online methods whenever the exact bias is small. These results deliver the first tight regret guarantees for contextual pricing in the presence of biased offline data. Our techniques also transfer verbatim to stochastic linear bandits with biased offline data, yielding analogous bounds.
Related papers
- Regret minimization in Linear Bandits with offline data via extended D-optimal exploration [25.533340168328564]
We consider the problem of online regret in linear bandits with access to prior observations (offline data) from the underlying bandit model.<n>Our algorithm, Offline-Online Phased Elimination (OOPE), effectively incorporates the offline data to substantially reduce the online regret compared to prior work.
arXiv Detail & Related papers (2025-08-11T19:14:56Z) - Actor-Critics Can Achieve Optimal Sample Efficiency [15.033410073144939]
We introduce a novel actor-critic algorithm that attains a sample-complexity of $O(dH5 log|mathcalA|/epsilon2 + d H4 log|mathcalF|/ epsilon2)$ trajectories.<n>We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL.
arXiv Detail & Related papers (2025-05-06T17:32:39Z) - Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update [62.96781471194877]
Two principled strategies for handling heavy-tailed noise, truncation and median-of-means, have been introduced to heavy-tailed bandits.<n>We propose a emphone-pass algorithm based on the online mirror descent framework.
arXiv Detail & Related papers (2025-03-01T09:41:45Z) - Learning to Price Homogeneous Data [6.288169915425957]
We develop novel discretization schemes to approximate any pricing curve.
Our online algorithms build on classical algorithms such as UCB and FTPL.
Using the improved discretization schemes, we are able to achieve $tildeO(msqrtT)$ regret in the setting and $tildeO(m3/2sqrtT)$ regret in the adversarial setting.
arXiv Detail & Related papers (2024-07-07T20:02:52Z) - Leveraging Offline Data in Linear Latent Contextual Bandits [27.272915631913175]
We design end-to-end latent bandit algorithms capable of handing uncountably many latent states.<n>We present two online algorithms that utilize the output of this offline algorithm to accelerate online learning.
arXiv Detail & Related papers (2024-05-27T16:23:34Z) - On the Interplay Between Misspecification and Sub-optimality Gap in
Linear Contextual Bandits [76.2262680277608]
We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class.
We show that our algorithm enjoys the same gap-dependent regret bound $tilde O (d2/Delta)$ as in the well-specified setting up to logarithmic factors.
arXiv Detail & Related papers (2023-03-16T15:24:29Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Towards Instance-Optimal Offline Reinforcement Learning with Pessimism [34.54294677335518]
We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP)
In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches [ Oleft(sum_h=1Hsum_s_h,a_hdpistar_h(s_h,a_h)sqrtfracmathrmmathrmVar_
arXiv Detail & Related papers (2021-10-17T01:21:52Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Smoothed Analysis with Adaptive Adversaries [28.940767013349408]
We prove novel algorithmic guarantees for several online problems in the smoothed analysis model.
In this model, at each time an adversary chooses an input distribution with density function bounded above by $tfrac1sigma$ times that of the uniform distribution.
Our results hold for em adaptive adversaries that can choose an input distribution based on the decisions of the algorithm and the realizations of the inputs in the previous time steps.
arXiv Detail & Related papers (2021-02-16T20:54:49Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Online Robust Regression via SGD on the l1 loss [19.087335681007477]
We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner.
We show in this work that the descent on the $ell_O( 1 / (1 - eta)2 n )$ loss converges to the true parameter vector at a $tildeO( 1 / (1 - eta)2 n )$ rate which is independent of the values of the contaminated measurements.
arXiv Detail & Related papers (2020-07-01T11:38:21Z) - On Frequentist Regret of Linear Thompson Sampling [8.071506311915398]
This paper studies the linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $mathbbRd$ and receives noisy rewards.
The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions.
arXiv Detail & Related papers (2020-06-11T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.