AliExpress Learning-To-Rank: Maximizing Online Model Performance without
Going Online
- URL: http://arxiv.org/abs/2003.11941v5
- Date: Thu, 31 Dec 2020 10:04:48 GMT
- Title: AliExpress Learning-To-Rank: Maximizing Online Model Performance without
Going Online
- Authors: Guangda Huzhang, Zhen-Jia Pang, Yongqing Gao, Yawen Liu, Weijie Shen,
Wen-Ji Zhou, Qing Da, An-Xiang Zeng, Han Yu, and Yang Yu, and Zhi-Hua Zhou
- Abstract summary: This paper proposes an evaluator-generator framework for learning-to-rank.
It consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning.
Our method achieves a significant improvement in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.
- Score: 60.887637616379926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning-to-rank (LTR) has become a key technology in E-commerce
applications. Most existing LTR approaches follow a supervised learning
paradigm from offline labeled data collected from the online system. However,
it has been noticed that previous LTR models can have a good validation
performance over offline validation data but have a poor online performance,
and vice versa, which implies a possible large inconsistency between the
offline and online evaluation. We investigate and confirm in this paper that
such inconsistency exists and can have a significant impact on AliExpress
Search. Reasons for the inconsistency include the ignorance of item context
during the learning, and the offline data set is insufficient for learning the
context. Therefore, this paper proposes an evaluator-generator framework for
LTR with item context. The framework consists of an evaluator that generalizes
to evaluate recommendations involving the context, and a generator that
maximizes the evaluator score by reinforcement learning, and a discriminator
that ensures the generalization of the evaluator. Extensive experiments in
simulation environments and AliExpress Search online system show that, firstly,
the classic data-based metrics on the offline dataset can show significant
inconsistency with online performance, and can even be misleading. Secondly,
the proposed evaluator score is significantly more consistent with the online
performance than common ranking metrics. Finally, as the consequence, our
method achieves a significant improvement (\textgreater$2\%$) in terms of
Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B
tests.
Related papers
- The Effects of Data Split Strategies on the Offline Experiments for CTR Prediction [0.0]
This study aims to address the inconsistency between current offline evaluation methods and real-world use cases.
We conduct extensive experiments using both random and temporal splits on a large open benchmark dataset, Criteo.
arXiv Detail & Related papers (2024-06-26T13:01:52Z) - Online Bandit Learning with Offline Preference Data [15.799929216215672]
We propose a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback.
We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively.
arXiv Detail & Related papers (2024-06-13T20:25:52Z) - ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles [52.34951901588738]
We propose a novel framework called ENsemble-based Offline-To-Online (ENOTO) RL.
By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance.
Experimental results demonstrate that ENOTO can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods.
arXiv Detail & Related papers (2023-06-12T05:10:10Z) - Adaptive Policy Learning for Offline-to-Online Reinforcement Learning [27.80266207283246]
We consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online.
We propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data.
arXiv Detail & Related papers (2023-03-14T08:13:21Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Imitate TheWorld: A Search Engine Simulation Platform [13.011052642314421]
We build a simulated search engine AESim that can properly give feedback by a well-trained discriminator for generated pages.
Different from previous simulation platforms which lose connection with the real world, ours depends on the real data in Search.
Our experiments also show AESim can better reflect the online performance of ranking models than classic ranking metrics.
arXiv Detail & Related papers (2021-07-16T03:55:33Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z) - Do Offline Metrics Predict Online Performance in Recommender Systems? [79.48653445643865]
We investigate the extent to which offline metrics predict online performance by evaluating recommenders across six simulated environments.
We observe that offline metrics are correlated with online performance over a range of environments.
We study the impact of adding exploration strategies, and observe that their effectiveness, when compared to greedy recommendation, is highly dependent on the recommendation algorithm.
arXiv Detail & Related papers (2020-11-07T01:41:13Z) - Modeling Online Behavior in Recommender Systems: The Importance of
Temporal Context [30.894950420437926]
We show how omitting temporal context when evaluating recommender system performance leads to false confidence.
We propose a training procedure to further embed the temporal context in existing models.
Results show that including our temporal objective can improve recall@20 by up to 20%.
arXiv Detail & Related papers (2020-09-19T19:36:43Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.