Generator and Critic: A Deep Reinforcement Learning Approach for Slate
Re-ranking in E-commerce
- URL: http://arxiv.org/abs/2005.12206v1
- Date: Mon, 25 May 2020 16:24:01 GMT
- Title: Generator and Critic: A Deep Reinforcement Learning Approach for Slate
Re-ranking in E-commerce
- Authors: Jianxiong Wei, Anxiang Zeng, Yueqiu Wu, Peng Guo, Qingsong Hua,
Qingpeng Cai
- Abstract summary: We present a novel Generator and Critic slate re-ranking approach, where the Critic evaluates the slate and the Generator ranks the items by the reinforcement learning approach.
For the Generator, to tackle the problem of large action space, we propose a new exploration reinforcement learning algorithm, called PPO-Exploration.
- Score: 17.712394984304336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The slate re-ranking problem considers the mutual influences between items to
improve user satisfaction in e-commerce, compared with the point-wise ranking.
Previous works either directly rank items by an end to end model, or rank items
by a score function that trades-off the point-wise score and the diversity
between items. However, there are two main existing challenges that are not
well studied: (1) the evaluation of the slate is hard due to the complex mutual
influences between items of one slate; (2) even given the optimal evaluation,
searching the optimal slate is challenging as the action space is exponentially
large. In this paper, we present a novel Generator and Critic slate re-ranking
approach, where the Critic evaluates the slate and the Generator ranks the
items by the reinforcement learning approach. We propose a Full Slate Critic
(FSC) model that considers the real impressed items and avoids the impressed
bias of existing models. For the Generator, to tackle the problem of large
action space, we propose a new exploration reinforcement learning algorithm,
called PPO-Exploration. Experimental results show that the FSC model
significantly outperforms the state of the art slate evaluation methods, and
the PPO-Exploration algorithm outperforms the existing reinforcement learning
methods substantially. The Generator and Critic approach improves both the
slate efficiency(4% gmv and 5% number of orders) and diversity in live
experiments on one of the largest e-commerce websites in the world.
Related papers
- F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods [111.46455901113976]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - Investigating the Robustness of Sequential Recommender Systems Against
Training Data Perturbations [9.463133630647569]
We introduce Finite Rank-Biased Overlap (FRBO), an enhanced similarity tailored explicitly for finite rankings.
We empirically investigate the impact of removing items at different positions within a temporally ordered sequence.
Our results demonstrate that removing items at the end of the sequence has a statistically significant impact on performance.
arXiv Detail & Related papers (2023-07-24T23:26:46Z) - SimOAP: Improve Coherence and Consistency in Persona-based Dialogue
Generation via Over-sampling and Post-evaluation [54.66399120084227]
Language models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue.
For the persona-based dialogue generation task, consistency and coherence are great challenges for language models.
A two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation.
arXiv Detail & Related papers (2023-05-18T17:23:00Z) - Pre-training Language Model as a Multi-perspective Course Learner [103.17674402415582]
This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training.
In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics.
Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
arXiv Detail & Related papers (2023-05-06T09:02:10Z) - PIER: Permutation-Level Interest-Based End-to-End Re-ranking Framework
in E-commerce [13.885695433738437]
Existing re-ranking methods directly take the initial ranking list as input, and generate the optimal permutation through a well-designed context-wise model.
evaluating all candidate permutations brings unacceptable computational costs in practice.
This paper presents a novel end-to-end re-ranking framework named PIER to tackle the above challenges.
arXiv Detail & Related papers (2023-02-06T09:17:52Z) - Multi-Objective Personalized Product Retrieval in Taobao Search [27.994166796745496]
We propose a novel Multi-Objective Personalized Product Retrieval (MOPPR) model with four hierarchical optimization objectives: relevance, exposure, click and purchase.
MOPPR achieves 0.96% transaction and 1.29% GMV improvements in a 28-day online A/B test.
Since the Double-11 shopping festival of 2021, MOPPR has been fully deployed in mobile Taobao search, replacing the previous MGDSPR.
arXiv Detail & Related papers (2022-10-09T05:18:42Z) - WSLRec: Weakly Supervised Learning for Neural Sequential Recommendation
Models [24.455665093145818]
We propose a novel model-agnostic training approach called WSLRec, which adopts a three-stage framework: pre-training, top-$k$ mining, intrinsic and fine-tuning.
WSLRec resolves the incompleteness problem by pre-training models on extra weak supervisions from model-free methods like BR and ItemCF, while resolving the inaccuracy problem by leveraging the top-$k$ mining to screen out reliable user-item relevance from weak supervisions for fine-tuning.
arXiv Detail & Related papers (2022-02-28T08:55:12Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks.
Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL.
Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z) - Learning Robust Models for e-Commerce Product Search [23.537201383165755]
Showing items that do not match search query intent degrades customer experience in e-commerce.
Mitigating the problem requires a large labeled dataset.
We develop a deep, end-to-end model that learns to effectively classify mismatches.
arXiv Detail & Related papers (2020-05-07T17:22:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.