On the Role of Weight Decay in Collaborative Filtering: A Popularity Perspective
- URL: http://arxiv.org/abs/2505.11318v1
- Date: Fri, 16 May 2025 14:41:57 GMT
- Title: On the Role of Weight Decay in Collaborative Filtering: A Popularity Perspective
- Authors: Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Danai Koutra,
- Abstract summary: Collaborative filtering (CF) enables large-scale recommendation systems by encoding information from historical user-item interactions into dense ID-embedding tables.<n>We argue that one core component of these pipelines is heavily overlooked: weight decay.<n>We propose PRISM (Popularity-awaRe Initialization Strategy for embedding Magnitudes) to simplify the training of high-performing CF models.
- Score: 38.87580457343038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Collaborative filtering (CF) enables large-scale recommendation systems by encoding information from historical user-item interactions into dense ID-embedding tables. However, as embedding tables grow, closed-form solutions become impractical, often necessitating the use of mini-batch gradient descent for training. Despite extensive work on designing loss functions to train CF models, we argue that one core component of these pipelines is heavily overlooked: weight decay. Attaining high-performing models typically requires careful tuning of weight decay, regardless of loss, yet its necessity is not well understood. In this work, we question why weight decay is crucial in CF pipelines and how it impacts training. Through theoretical and empirical analysis, we surprisingly uncover that weight decay's primary function is to encode popularity information into the magnitudes of the embedding vectors. Moreover, we find that tuning weight decay acts as a coarse, non-linear knob to influence preference towards popular or unpopular items. Based on these findings, we propose PRISM (Popularity-awaRe Initialization Strategy for embedding Magnitudes), a straightforward yet effective solution to simplify the training of high-performing CF models. PRISM pre-encodes the popularity information typically learned through weight decay, eliminating its necessity. Our experiments show that PRISM improves performance by up to 4.77% and reduces training times by 38.48%, compared to state-of-the-art training strategies. Additionally, we parameterize PRISM to modulate the initialization strength, offering a cost-effective and meaningful strategy to mitigate popularity bias.
Related papers
- A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization [15.8696301825572]
Continuously-trained deep neural networks (DNNs) must rapidly learn new concepts while preserving and utilizing prior knowledge.<n>Weights for newly encountered categories are typically randomly, leading to high initial training loss (spikes) and instability.<n>Inspired by Neural Collapse (NC), we propose a weight initialization strategy to improve learning efficiency in CL.
arXiv Detail & Related papers (2025-03-09T01:44:22Z) - Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models [27.847140934456288]
This paper proposes a new weight decay technique, Selective Projection Decay (SPD)
SPD selectively imposes a strong penalty on certain layers while allowing others to change freely.
When equipped with SPD, Adam consistently provides better in-distribution robustness and out-of-distribution performance on benchmarks.
arXiv Detail & Related papers (2024-11-03T23:36:53Z) - CLASSP: a Biologically-Inspired Approach to Continual Learning through Adjustment Suppression and Sparsity Promotion [0.0]
This paper introduces a new training method named Continual Learning through Adjustment Suppression and Sparsity Promotion (CLASSP)
CLASSP is based on two main principles observed in neuroscience, particularly in the context of synaptic transmission and Long-Term Potentiation.
When compared with Elastic Weight Consolidation (EWC) datasets, CLASSP demonstrates superior performance in terms of accuracy and memory footprint.
arXiv Detail & Related papers (2024-04-29T13:31:00Z) - Why Do We Need Weight Decay in Modern Deep Learning? [24.81634291051533]
Weight decay is a technique for training state-of-the-art deep networks from image classification to large language models.
In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory.
For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the implicit regularization of SGD.
arXiv Detail & Related papers (2023-10-06T17:58:21Z) - FedNAR: Federated Optimization with Normalized Annealing Regularization [54.42032094044368]
We explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms.
We develop Federated optimization with Normalized Annealing Regularization (FedNAR), a plug-in that can be seamlessly integrated into any existing FL algorithms.
arXiv Detail & Related papers (2023-10-04T21:11:40Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - FixNorm: Dissecting Weight Decay for Training Deep Neural Networks [7.820667552233989]
We propose a new training method called FixNorm, which discards weight decay and directly controls the two mechanisms.
On ImageNet classification task, training EfficientNet-B0 with FixNorm achieves 77.7%, which outperforms the original baseline by a clear margin.
arXiv Detail & Related papers (2021-03-29T05:41:56Z) - On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective [90.39123717733334]
We present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method.
Our experiments also support that SWD indeed mitigates large gradient norms and often significantly outperforms the conventional constant weight decay strategy for Adaptive Moment Estimation (Adam)
arXiv Detail & Related papers (2020-11-23T00:39:49Z) - PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph
Generation [58.98802062945709]
We propose a novel Predicate-Correlation Perception Learning scheme to adaptively seek out appropriate loss weights.
Our PCPL framework is further equipped with a graph encoder module to better extract context features.
arXiv Detail & Related papers (2020-09-02T08:30:09Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.