Related papers: Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

URL: http://arxiv.org/abs/2107.14432v6
Date: Thu, 05 Dec 2024 08:11:50 GMT
Title: Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction
Authors: Yun Yue, Yongchao Liu, Suo Tong, Minghao Li, Zhen Zhang, Chunyang Wen, Huanjun Bao, Lihong Gu, Jinjie Gu, Yixiang Mu,
Abstract summary: We develop a novel framework that adds regularizers of the sparse group lasso to a family of adaptives in deep learning.<n>We establish proven convergence guarantees in the theoretically convex settings.<n>Our methods can achieve extremely high sparsity with significantly better or highly competitive performance.
Score: 19.08180531016811
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance. The code is available at https://github.com/intelligent-machine-learning/tfplus/tree/main/tfplus.

Related papers

Latent Geometry of Taste: Scalable Low-Rank Matrix Factorization for Recommender Systems [0.0]
This work investigates the latent geometry of user preferences using the MovieLens 32M dataset.<n>We demonstrate that constrained low-rank models significantly outperform higher dimensional counterparts in terms of ranking precision.<n>We validate the system's practical utility in a cold-start scenario, introducing a tunable scoring parameter to manage the trade-off between popularity bias and personalized affinity effectively.
arXiv Detail & Related papers (2026-01-06T23:42:40Z)
Robust Mixture Models for Algorithmic Fairness Under Latent Heterogeneity [8.425890077048374]
We introduce ROME, a framework that learns latent group structure from data while optimizing for worst-group performance.<n>ROME significantly improves algorithmic fairness compared to standard methods while maintaining competitive average performance.<n>Our method requires no predefined group labels, making it practical when sources of disparities are unknown or evolving.
arXiv Detail & Related papers (2025-09-22T07:03:33Z)
An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
Taming LLMs by Scaling Learning Rates with Gradient Grouping [49.91587150497186]
Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures.<n>This work introduces Scaling with Gradient Grouping (SGG), an gradient wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling.
arXiv Detail & Related papers (2025-06-01T15:30:37Z)
Some Optimizers are More Equal: Understanding the Role of Optimizers in Group Fairness [38.75794068516019]
We study whether and how the choice optimization algorithm can impact group fairness in deep neural networks.<n>Our results highlight the role of adaptive updates as a crucial yet overlooked mechanism for promoting fair outcomes.
arXiv Detail & Related papers (2025-04-21T06:20:50Z)
Scaling LLM Inference with Optimized Sample Compute Allocation [56.524278187351925]
We propose OSCA, an algorithm to find an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration. OSCA is also shown to be effective in agentic beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration.
arXiv Detail & Related papers (2024-10-29T19:17:55Z)
Self-DenseMobileNet: A Robust Framework for Lung Nodule Classification using Self-ONN and Stacking-based Meta-Classifier [1.2300841481611335]
Self-DenseMobileNet is designed to enhance the classification of nodules and non-nodules in chest radiographs (CXRs) Our framework integrates advanced image standardization and enhancement techniques to optimize the input quality. When tested on an external dataset, the framework maintained strong generalizability with an accuracy of 89.40%.
arXiv Detail & Related papers (2024-10-16T14:04:06Z)
Edge-Efficient Deep Learning Models for Automatic Modulation Classification: A Performance Analysis [0.7428236410246183]
We investigate optimized convolutional neural networks (CNNs) developed for automatic modulation classification (AMC) of wireless signals. We propose optimized models with the combinations of these techniques to fuse the complementary optimization benefits. The experimental results show that the proposed individual and combined optimization techniques are highly effective for developing models with significantly less complexity.
arXiv Detail & Related papers (2024-04-11T06:08:23Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Soft Merging: A Flexible and Robust Soft Model Merging Approach for Enhanced Neural Network Performance [6.599368083393398]
Gradient (SGD) is often limited to converging local optima to improve model performance. em soft merging method minimizes the obtained local optima models in undesirable results. Experiments underscore the effectiveness of the merged networks.
arXiv Detail & Related papers (2023-09-21T17:07:31Z)
Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers [109.52244418498974]
We propose a novel textscAdmeta (textbfADouble exponential textbfMov averagtextbfE textbfAdaptive and non-adaptive momentum) framework. We provide two implementations, textscAdmetaR and textscAdmetaS, the former based on RAdam and the latter based on SGDM.
arXiv Detail & Related papers (2023-07-02T18:16:06Z)
CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances. We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles. We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates. We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z)
GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector [156.43671738038657]
We present a novel end-to-end group collaborative learning network, termed GCoNet+. GCoNet+ can effectively and efficiently identify co-salient objects in natural scenes.
arXiv Detail & Related papers (2022-05-30T23:49:19Z)
Adaptive Optimization with Examplewise Gradients [23.504973357538418]
We propose a new, more general approach to the design of gradient-based optimization methods for machine learning. In this new framework, iterations assume access to a batch of estimates per parameter, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups.
arXiv Detail & Related papers (2021-11-30T23:37:01Z)
Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models. We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs. Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z)
An Efficient Framework for Clustered Federated Learning [26.24231986590374]
We address the problem of federated learning (FL) where users are distributed into clusters. We propose the Iterative Federated Clustering Algorithm (IFCA) We show that our algorithm is efficient in non- partitioned problems such as neural networks.
arXiv Detail & Related papers (2020-06-07T08:48:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.