Adaptively Optimize Content Recommendation Using Multi Armed Bandit
Algorithms in E-commerce
- URL: http://arxiv.org/abs/2108.01440v1
- Date: Fri, 30 Jul 2021 21:03:38 GMT
- Title: Adaptively Optimize Content Recommendation Using Multi Armed Bandit
Algorithms in E-commerce
- Authors: Ding Xiang, Becky West, Jiaqi Wang, Xiquan Cui, Jinzhou Huang
- Abstract summary: We analyze using three classic MAB algorithms, epsilon-greedy, Thompson sampling (TS), and upper confidence bound 1 (UCB1) for dynamic content recommendations.
We compare the accumulative rewards of the three MAB algorithms with more than 1,000 trials using actual historical A/B test datasets.
We develop a batch-updated MAB algorithm to overcome the delayed reward issue in e-commerce.
- Score: 4.143179903857126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: E-commerce sites strive to provide users the most timely relevant information
in order to reduce shopping frictions and increase customer satisfaction. Multi
armed bandit models (MAB) as a type of adaptive optimization algorithms provide
possible approaches for such purposes. In this paper, we analyze using three
classic MAB algorithms, epsilon-greedy, Thompson sampling (TS), and upper
confidence bound 1 (UCB1) for dynamic content recommendations, and walk through
the process of developing these algorithms internally to solve a real world
e-commerce use case. First, we analyze the three MAB algorithms using simulated
purchasing datasets with non-stationary reward distributions to simulate the
possible time-varying customer preferences, where the traffic allocation
dynamics and the accumulative rewards of different algorithms are studied.
Second, we compare the accumulative rewards of the three MAB algorithms with
more than 1,000 trials using actual historical A/B test datasets. We find that
the larger difference between the success rates of competing recommendations
the more accumulative rewards the MAB algorithms can achieve. In addition, we
find that TS shows the highest average accumulative rewards under different
testing scenarios. Third, we develop a batch-updated MAB algorithm to overcome
the delayed reward issue in e-commerce and enable an online content
optimization on our App homepage. For a state-of-the-art comparison, a real A/B
test among our batch-updated MAB algorithm, a third-party MAB solution, and the
default business logic are conducted. The result shows that our batch-updated
MAB algorithm outperforms the counterparts and achieves 6.13% relative
click-through rate (CTR) increase and 16.1% relative conversion rate (CVR)
increase compared to the default experience, and 2.9% relative CTR increase and
1.4% relative CVR increase compared to the external MAB service.
Related papers
- LOLA: LLM-Assisted Online Learning Algorithm for Content Experiments [2.2021543101231167]
This paper introduces the LLM-Assisted Online Learning Algorithm (LOLA) to optimize content delivery.
We first investigate three broad pure-LLM approaches: prompt-based methods, embedding-based classification models, and fine-tuned open-source LLMs.
We then introduce LOLA, which combines the best pure-LLM approach with the Upper Confidence Bound algorithm to adaptively allocate traffic and maximize clicks.
arXiv Detail & Related papers (2024-06-03T07:56:58Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning [50.92957910121088]
This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS)
For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium.
We extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.
arXiv Detail & Related papers (2024-04-30T06:48:56Z) - An Improved Reinforcement Learning Algorithm for Learning to Branch [12.27934038849211]
Branch-and-bound (B&B) is a general and widely used method for optimization.
In this paper, we propose a novel reinforcement learning-based B&B algorithm.
We evaluate the performance of the proposed algorithm over three public research benchmarks.
arXiv Detail & Related papers (2022-01-17T04:50:11Z) - Fast Variational AutoEncoder with Inverted Multi-Index for Collaborative
Filtering [59.349057602266]
Variational AutoEncoder (VAE) has been extended as a representative nonlinear method for collaborative filtering.
We propose to decompose the inner-product-based softmax probability based on the inverted multi-index.
FastVAE can outperform the state-of-the-art baselines in terms of both sampling quality and efficiency.
arXiv Detail & Related papers (2021-09-13T08:31:59Z) - Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and
Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area.
Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration.
This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z) - Learning from eXtreme Bandit Feedback [105.0383130431503]
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces.
In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime.
We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks.
arXiv Detail & Related papers (2020-09-27T20:47:25Z) - An Extensive Experimental Evaluation of Automated Machine Learning
Methods for Recommending Classification Algorithms (Extended Version) [4.400989370979334]
Three of these methods are based on Evolutionary Algorithms (EAs), and the other is Auto-WEKA, a well-known AutoML method.
We performed controlled experiments where these four AutoML methods were given the same runtime limit for different values of this limit.
In general, the difference in predictive accuracy of the three best AutoML methods was not statistically significant.
arXiv Detail & Related papers (2020-09-16T02:36:43Z) - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.