Policy Optimization Using Semiparametric Models for Dynamic Pricing
- URL: http://arxiv.org/abs/2109.06368v1
- Date: Mon, 13 Sep 2021 23:50:01 GMT
- Title: Policy Optimization Using Semiparametric Models for Dynamic Pricing
- Authors: Jianqing Fan, Yongyi Guo, Mengxin Yu
- Abstract summary: We study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise.
We propose a dynamic statistical learning and decision-making policy that combines semiparametric estimation from a generalized linear model with an unknown link and online decision-making.
- Score: 1.3428344011390776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the contextual dynamic pricing problem where the
market value of a product is linear in its observed features plus some market
noise. Products are sold one at a time, and only a binary response indicating
success or failure of a sale is observed. Our model setting is similar to
Javanmard and Nazerzadeh [2019] except that we expand the demand curve to a
semiparametric model and need to learn dynamically both parametric and
nonparametric components. We propose a dynamic statistical learning and
decision-making policy that combines semiparametric estimation from a
generalized linear model with an unknown link and online decision-making to
minimize regret (maximize revenue). Under mild conditions, we show that for a
market noise c.d.f. $F(\cdot)$ with $m$-th order derivative ($m\geq 2$), our
policy achieves a regret upper bound of $\tilde{O}_{d}(T^{\frac{2m+1}{4m-1}})$,
where $T$ is time horizon and $\tilde{O}_{d}$ is the order that hides
logarithmic terms and the dimensionality of feature $d$. The upper bound is
further reduced to $\tilde{O}_{d}(\sqrt{T})$ if $F$ is super smooth whose
Fourier transform decays exponentially. In terms of dependence on the horizon
$T$, these upper bounds are close to $\Omega(\sqrt{T})$, the lower bound where
$F$ belongs to a parametric class. We further generalize these results to the
case with dynamically dependent product features under the strong mixing
condition.
Related papers
- Provable In-context Learning for Mixture of Linear Regressions using Transformers [34.458004744956334]
We theoretically investigate the in-context learning capabilities of transformers in the context of learning mixtures of linear regression models.
For the case of two mixtures, we demonstrate the existence of transformers that can achieve an accuracy, relative to the oracle predictor, of order $mathcaltildeO((d/n)1/4)$ in the low signal-to-noise ratio (SNR) regime and $mathcaltildeO(sqrtd/n)$ in the high SNR regime.
arXiv Detail & Related papers (2024-10-18T05:28:47Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Model approximation in MDPs with unbounded per-step cost [3.456139143869137]
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $mathcalM$ when we only have access to an approximate model $hatmathcalM$.
How well does an optimal policy $hatpistar$ of the approximate model perform when used in the original model $mathcalM$?
We provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models.
arXiv Detail & Related papers (2024-02-13T21:36:30Z) - On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [59.65871549878937]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.
Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.
Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Dynamic Pricing and Learning under the Bass Model [16.823029377470366]
We develop an algorithm that satisfies a high probability regret guarantee of order $tilde O(m2/3)$; where the market size $m$ is known a priori.
Unlike most regret analysis results, in the present problem the market size $m$ is the fundamental driver of the complexity.
arXiv Detail & Related papers (2021-03-09T03:27:33Z) - Logarithmic Regret in Feature-based Dynamic Pricing [0.0]
Feature-based dynamic pricing is an increasingly popular model of setting prices for highly differentiated products.
We provide two algorithms for infracigen and adversarial feature settings, and prove the optimal $O(dlogT)$ regret bounds for both.
We also prove an $(sqrtT)$ information-theoretic lower bound for a slightly more general setting, which demonstrates that "knowing-the-demand-curve" leads to an exponential improvement in feature-based dynamic pricing.
arXiv Detail & Related papers (2021-02-20T00:45:33Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Model-Based Reinforcement Learning with Value-Targeted Regression [48.92439657407732]
We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $mathcalP$.
We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $tildemathcalO(dsqrtH3T)$.
arXiv Detail & Related papers (2020-06-01T17:47:53Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.