Multi-Epoch Learning for Deep Click-Through Rate Prediction Models
- URL: http://arxiv.org/abs/2305.19531v1
- Date: Wed, 31 May 2023 03:36:50 GMT
- Title: Multi-Epoch Learning for Deep Click-Through Rate Prediction Models
- Authors: Zhaocheng Liu, Zhongxiang Fan, Jian Liang, Dongying Kong, Han Li
- Abstract summary: The one-epoch overfitting phenomenon has been widely observed in industrial Click-Through Rate (CTR) applications.
We propose a novel Multi-Epoch learning with Data Augmentation (MEDA), which can be directly applied to most deep CTR models.
- Score: 32.80864867251999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The one-epoch overfitting phenomenon has been widely observed in industrial
Click-Through Rate (CTR) applications, where the model performance experiences
a significant degradation at the beginning of the second epoch. Recent advances
try to understand the underlying factors behind this phenomenon through
extensive experiments. However, it is still unknown whether a multi-epoch
training paradigm could achieve better results, as the best performance is
usually achieved by one-epoch training. In this paper, we hypothesize that the
emergence of this phenomenon may be attributed to the susceptibility of the
embedding layer to overfitting, which can stem from the high-dimensional
sparsity of data. To maintain feature sparsity while simultaneously avoiding
overfitting of embeddings, we propose a novel Multi-Epoch learning with Data
Augmentation (MEDA), which can be directly applied to most deep CTR models.
MEDA achieves data augmentation by reinitializing the embedding layer in each
epoch, thereby avoiding embedding overfitting and simultaneously improving
convergence. To our best knowledge, MEDA is the first multi-epoch training
paradigm designed for deep CTR prediction models. We conduct extensive
experiments on several public datasets, and the effectiveness of our proposed
MEDA is fully verified. Notably, the results show that MEDA can significantly
outperform the conventional one-epoch training. Besides, MEDA has exhibited
significant benefits in a real-world scene on Kuaishou.
Related papers
- Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios.
MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data.
Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z) - Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters.
We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z) - Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation [53.27596811146316]
Diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts.
We present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep.
We introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest.
arXiv Detail & Related papers (2024-01-17T07:58:18Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - How to Train Your DRAGON: Diverse Augmentation Towards Generalizable
Dense Retrieval [80.54532535622988]
We show that a generalizable dense retriever can be trained to achieve high accuracy in both supervised and zero-shot retrieval.
DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations.
arXiv Detail & Related papers (2023-02-15T03:53:26Z) - Towards Understanding the Overfitting Phenomenon of Deep Click-Through
Rate Prediction Models [16.984947259260878]
We observe an interesting one-epoch overfitting problem in Click-Through Rate (CTR) prediction.
The model performance exhibits a dramatic degradation at the beginning of the second epoch.
Thereby, the best performance is usually achieved by training with only one epoch.
arXiv Detail & Related papers (2022-09-04T11:36:16Z) - CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge
Distillation [30.56389761245621]
Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models.
Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adrial Training.
We propose a learning based data augmentation technique tailored for knowledge distillation, called CILDA.
arXiv Detail & Related papers (2022-04-15T23:16:37Z) - When and how epochwise double descent happens [7.512375012141203]
An epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time.
This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization.
We show that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective.
arXiv Detail & Related papers (2021-08-26T19:19:17Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training.
We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.