RecFlow: An Industrial Full Flow Recommendation Dataset
- URL: http://arxiv.org/abs/2410.20868v1
- Date: Mon, 28 Oct 2024 09:36:03 GMT
- Title: RecFlow: An Industrial Full Flow Recommendation Dataset
- Authors: Qi Liu, Kai Zheng, Rui Huang, Wuchao Li, Kuo Cai, Yuan Chai, Yanan Niu, Yiqun Hui, Bing Han, Na Mou, Hongning Wang, Wentian Bao, Yunen Yu, Guorui Zhou, Han Li, Yang Song, Defu Lian, Kun Gai,
- Abstract summary: Industrial recommendation systems rely on the multi-stage pipeline to balance effectiveness and efficiency when delivering items to users.
We introduce RecFlow, an industrial full flow recommendation dataset designed to bridge the gap between offline RS benchmarks and the real online environment.
Our dataset comprises 38M interactions from 42K users across nearly 9M items with additional 1.9B stage samples collected from 9.3M online requests over 37 days and spanning 6 stages.
- Score: 66.06445386541122
- License:
- Abstract: Industrial recommendation systems (RS) rely on the multi-stage pipeline to balance effectiveness and efficiency when delivering items from a vast corpus to users. Existing RS benchmark datasets primarily focus on the exposure space, where novel RS algorithms are trained and evaluated. However, when these algorithms transition to real world industrial RS, they face a critical challenge of handling unexposed items which are a significantly larger space than the exposed one. This discrepancy profoundly impacts their practical performance. Additionally, these algorithms often overlook the intricate interplay between multiple RS stages, resulting in suboptimal overall system performance. To address this issue, we introduce RecFlow, an industrial full flow recommendation dataset designed to bridge the gap between offline RS benchmarks and the real online environment. Unlike existing datasets, RecFlow includes samples not only from the exposure space but also unexposed items filtered at each stage of the RS funnel. Our dataset comprises 38M interactions from 42K users across nearly 9M items with additional 1.9B stage samples collected from 9.3M online requests over 37 days and spanning 6 stages. Leveraging the RecFlow dataset, we conduct courageous exploration experiments, showcasing its potential in designing new algorithms to enhance effectiveness by incorporating stage-specific samples. Some of these algorithms have already been deployed online, consistently yielding significant gains. We propose RecFlow as the first comprehensive benchmark dataset for the RS community, supporting research on designing algorithms at any stage, study of selection bias, debiased algorithms, multi-stage consistency and optimality, multi-task recommendation, and user behavior modeling. The RecFlow dataset, along with the corresponding source code, is available at https://github.com/RecFlow-ICLR/RecFlow.
Related papers
- RMS-FlowNet++: Efficient and Robust Multi-Scale Scene Flow Estimation for Large-Scale Point Clouds [15.138542932078916]
RMS-FlowNet++ is a novel end-to-end learning-based architecture for accurate and efficient scene flow estimation.
Our architecture provides a faster prediction than state-of-the-art methods, avoids high memory requirements and enables efficient scene flow on dense point clouds of more than 250K points at once.
arXiv Detail & Related papers (2024-07-01T09:51:17Z) - NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking [65.24988062003096]
We present NAVSIM, a framework for benchmarking vision-based driving policies.
Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other.
NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights.
arXiv Detail & Related papers (2024-06-21T17:59:02Z) - Boundary-aware Decoupled Flow Networks for Realistic Extreme Rescaling [49.215957313126324]
Recently developed generative methods, including invertible rescaling network (IRN) based and generative adversarial network (GAN) based methods, have demonstrated exceptional performance in image rescaling.
However, IRN-based methods tend to produce over-smoothed results, while GAN-based methods easily generate fake details.
We propose Boundary-aware Decoupled Flow Networks (BDFlow) to generate realistic and visually pleasing results.
arXiv Detail & Related papers (2024-05-05T14:05:33Z) - OptFlow: Fast Optimization-based Scene Flow Estimation without
Supervision [6.173968909465726]
We present OptFlow, a fast optimization-based scene flow estimation method.
It achieves state-of-the-art performance for scene flow estimation on popular autonomous driving benchmarks.
arXiv Detail & Related papers (2024-01-04T21:47:56Z) - Soft Random Sampling: A Theoretical and Empirical Analysis [59.719035355483875]
Soft random sampling (SRS) is a simple yet effective approach for efficient deep neural networks when dealing with massive data.
It selects a uniformly speed at random with replacement from each data set in each epoch.
It is shown to be a powerful and competitive strategy with significant and competitive performance on real-world industrial scale.
arXiv Detail & Related papers (2023-11-21T17:03:21Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Comparative Study of Coupling and Autoregressive Flows through Robust
Statistical Tests [0.0]
We propose an in-depth comparison of coupling and autoregressive flows, both of the affine and rational quadratic type.
We focus on a set of multimodal target distributions increasing dimensionality ranging from 4 to 400.
Our results indicate that the A-RQS algorithm stands out both in terms of accuracy and training speed.
arXiv Detail & Related papers (2023-02-23T13:34:01Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - EMFlow: Data Imputation in Latent Space via EM and Deep Flow Models [5.076419064097734]
We introduce an imputation approach, called EMFlow, that performs imputation in a latent space via an online version of Expectation-Maximization algorithm.
The proposed EMFlow has superior performance to competing methods in terms of both imputation quality and convergence speed.
arXiv Detail & Related papers (2021-06-09T04:35:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.