Adaptively Optimize Content Recommendation Using Multi Armed Bandit
Algorithms in E-commerce
- URL: http://arxiv.org/abs/2108.01440v1
- Date: Fri, 30 Jul 2021 21:03:38 GMT
- Title: Adaptively Optimize Content Recommendation Using Multi Armed Bandit
Algorithms in E-commerce
- Authors: Ding Xiang, Becky West, Jiaqi Wang, Xiquan Cui, Jinzhou Huang
- Abstract summary: We analyze using three classic MAB algorithms, epsilon-greedy, Thompson sampling (TS), and upper confidence bound 1 (UCB1) for dynamic content recommendations.
We compare the accumulative rewards of the three MAB algorithms with more than 1,000 trials using actual historical A/B test datasets.
We develop a batch-updated MAB algorithm to overcome the delayed reward issue in e-commerce.
- Score: 4.143179903857126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: E-commerce sites strive to provide users the most timely relevant information
in order to reduce shopping frictions and increase customer satisfaction. Multi
armed bandit models (MAB) as a type of adaptive optimization algorithms provide
possible approaches for such purposes. In this paper, we analyze using three
classic MAB algorithms, epsilon-greedy, Thompson sampling (TS), and upper
confidence bound 1 (UCB1) for dynamic content recommendations, and walk through
the process of developing these algorithms internally to solve a real world
e-commerce use case. First, we analyze the three MAB algorithms using simulated
purchasing datasets with non-stationary reward distributions to simulate the
possible time-varying customer preferences, where the traffic allocation
dynamics and the accumulative rewards of different algorithms are studied.
Second, we compare the accumulative rewards of the three MAB algorithms with
more than 1,000 trials using actual historical A/B test datasets. We find that
the larger difference between the success rates of competing recommendations
the more accumulative rewards the MAB algorithms can achieve. In addition, we
find that TS shows the highest average accumulative rewards under different
testing scenarios. Third, we develop a batch-updated MAB algorithm to overcome
the delayed reward issue in e-commerce and enable an online content
optimization on our App homepage. For a state-of-the-art comparison, a real A/B
test among our batch-updated MAB algorithm, a third-party MAB solution, and the
default business logic are conducted. The result shows that our batch-updated
MAB algorithm outperforms the counterparts and achieves 6.13% relative
click-through rate (CTR) increase and 16.1% relative conversion rate (CVR)
increase compared to the default experience, and 2.9% relative CTR increase and
1.4% relative CVR increase compared to the external MAB service.
Related papers
- Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Improving Portfolio Optimization Results with Bandit Networks [0.0]
We introduce and evaluate novel Bandit algorithms designed for non-stationary environments.
First, we present the Adaptive Discounted Thompson Sampling (ADTS) algorithm.
We then extend this approach to the Portfolio Optimization problem by introducing the Combinatorial Adaptive Discounted Thompson Sampling (CADTS) algorithm.
arXiv Detail & Related papers (2024-10-05T16:17:31Z) - Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback [64.67540769692074]
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date.
We introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models.
Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench.
arXiv Detail & Related papers (2024-10-04T04:56:11Z) - LOLA: LLM-Assisted Online Learning Algorithm for Content Experiments [2.2021543101231167]
This paper introduces the LLM-Assisted Online Learning Algorithm (LOLA)
LOLA integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery.
Our numerical experiments on Upworthy data show LOLA outperforms the standard A/B test method.
arXiv Detail & Related papers (2024-06-03T07:56:58Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning [50.92957910121088]
This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS)
For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium.
We extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.
arXiv Detail & Related papers (2024-04-30T06:48:56Z) - An Improved Reinforcement Learning Algorithm for Learning to Branch [12.27934038849211]
Branch-and-bound (B&B) is a general and widely used method for optimization.
In this paper, we propose a novel reinforcement learning-based B&B algorithm.
We evaluate the performance of the proposed algorithm over three public research benchmarks.
arXiv Detail & Related papers (2022-01-17T04:50:11Z) - Fast Variational AutoEncoder with Inverted Multi-Index for Collaborative
Filtering [59.349057602266]
Variational AutoEncoder (VAE) has been extended as a representative nonlinear method for collaborative filtering.
We propose to decompose the inner-product-based softmax probability based on the inverted multi-index.
FastVAE can outperform the state-of-the-art baselines in terms of both sampling quality and efficiency.
arXiv Detail & Related papers (2021-09-13T08:31:59Z) - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.