Climber: Toward Efficient Scaling Laws for Large Recommendation Models
- URL: http://arxiv.org/abs/2502.09888v2
- Date: Thu, 28 Aug 2025 01:40:30 GMT
- Title: Climber: Toward Efficient Scaling Laws for Large Recommendation Models
- Authors: Songpei Xu, Shijia Wang, Da Guo, Xianwen Guo, Qiang Xiao, Bin Huang, Guanlin Wu, Chuanjiang Luo,
- Abstract summary: We propose Climber, an efficient recommendation framework comprising two synergistic components.<n>Our proposed model adopts two core innovations: (1) multi-scale sequence extraction that achieves a time complexity reduction by a constant factor, enabling more efficient scaling with sequence length; (2) dynamic temperature modulation adapting attention distributions to the multi-scenario and multi-behavior patterns.<n> Climber has been successfully deployed on Netease Cloud Music, one of China's largest music streaming platforms, serving tens of millions of users daily.
- Score: 8.970144942471425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based generative models have achieved remarkable success across domains with various scaling law manifestations. However, our extensive experiments reveal persistent challenges when applying Transformer to recommendation systems: (1) Transformer scaling is not ideal with increased computational resources, due to structural incompatibilities with recommendation-specific features such as multi-source data heterogeneity; (2) critical online inference latency constraints (tens of milliseconds) that intensify with longer user behavior sequences and growing computational demands. We propose Climber, an efficient recommendation framework comprising two synergistic components: the model architecture for efficient scaling and the co-designed acceleration techniques. Our proposed model adopts two core innovations: (1) multi-scale sequence extraction that achieves a time complexity reduction by a constant factor, enabling more efficient scaling with sequence length; (2) dynamic temperature modulation adapting attention distributions to the multi-scenario and multi-behavior patterns. Complemented by acceleration techniques, Climber achieves a 5.15$\times$ throughput gain without performance degradation by adopting a "single user, multiple item" batched processing and memory-efficient Key-Value caching. Comprehensive offline experiments on multiple datasets validate that Climber exhibits a more ideal scaling curve. To our knowledge, this is the first publicly documented framework where controlled model scaling drives continuous online metric growth (12.19\% overall lift) without prohibitive resource costs. Climber has been successfully deployed on Netease Cloud Music, one of China's largest music streaming platforms, serving tens of millions of users daily.
Related papers
- MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders [11.566232697512879]
MixFormer is a unified Transformer-style architecture tailored for recommender systems.<n>It jointly models sequential behaviors and feature interactions within a single backbone.<n>Experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency.
arXiv Detail & Related papers (2026-02-15T11:53:30Z) - GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder [54.64137490632567]
We propose a novel and unified framework designed to capture users' sequences from long-term history.<n>Generative Multi-streamers ( GEMs) break user sequences into three streams.<n>Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-the-art methods in recommendation accuracy.
arXiv Detail & Related papers (2026-02-14T06:42:56Z) - Compress, Cross and Scale: Multi-Level Compression Cross Networks for Efficient Scaling in Recommender Systems [5.897678894426804]
MLCC is a structured feature interaction architecture that organizes feature crosses through hierarchical compression and dynamic composition.<n>MC-MLCC is a Multi-Channel extension that decomposes feature interactions into parallel subspaces.<n>Our proposed models consistently outperform strong DLRM-style baselines by up to 0.52 AUC, while reducing model parameters and FLOPs by up to 26$times$ under comparable performance.
arXiv Detail & Related papers (2026-02-12T15:06:46Z) - LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation [9.59487558742976]
We present LLaTTE, a scalable transformer architecture for production ads recommendation.<n>We demonstrate that sequence modeling in recommendation systems follows predictable power-law scaling similar to LLMs.<n>We find that semantic features bend the scaling curve, enabling the model to effectively utilize the capacity of deeper and longer architectures.
arXiv Detail & Related papers (2026-01-27T21:59:36Z) - MTGR: Industrial-Scale Generative Recommendation Framework in Meituan [28.92150571719811]
We propose MTGR (Meituan Generative Recommendation) to address this issue.<n> MTGR achieves training and inference acceleration through user-level compression to ensure efficient scaling.<n>This breakthrough was successfully deployed on Meituan, the world's largest food delivery platform.
arXiv Detail & Related papers (2025-05-24T11:47:28Z) - SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models [21.933379266533098]
Large Language Models (LLMs) present a critical trade-off between inference quality and computational cost.<n>Existing serving strategies often employ fixed model scales or static two-stage speculative decoding.<n>This paper introduces systemname, a novel framework that reimagines LLM inference as an adaptive routing problem.
arXiv Detail & Related papers (2025-05-12T15:46:28Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders [23.70714095931094]
Long-sequence optimized traNsformer for GPU-Efficient Recommenders.<n>Longer consistently outperforms strong baselines in offline metrics and online A/B testing.
arXiv Detail & Related papers (2025-05-07T13:54:26Z) - Towards Scalable and Deep Graph Neural Networks via Noise Masking [59.058558158296265]
Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks.<n> scaling them to large graphs is challenging due to the high computational and storage costs.<n>We present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works.
arXiv Detail & Related papers (2024-12-19T07:48:14Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization [18.00873866263434]
Fine-tuning large language models (LLMs) poses significant memory challenges.
Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method.
We introduce HELENE, a novel scalable and memory-efficient pre-conditioner.
arXiv Detail & Related papers (2024-11-16T04:27:22Z) - SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism [1.7170348600689374]
We propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism.
We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks.
Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.
arXiv Detail & Related papers (2024-11-15T19:29:51Z) - Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution [1.8029479474051309]
We design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary.
Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model's output is uncertain.
Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone.
arXiv Detail & Related papers (2024-10-16T02:06:27Z) - Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule [50.260693393896716]
Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images.
Recent techniques have been employed to automatically search for faster generation processes.
We introduce Flexiffusion, a novel training-free NAS paradigm designed to accelerate diffusion models.
arXiv Detail & Related papers (2024-09-26T06:28:05Z) - AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [22.950914612765494]
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks.
Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph.
We propose the Adaptive Zeroth-order-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods.
arXiv Detail & Related papers (2024-06-26T04:33:13Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model
Effectiveness and Efficiency [10.641875933652647]
We introduce multi-granularity architecture search (MGAS) to discover both effective and efficient neural networks.
We learn discretization functions specific to each granularity level to adaptively determine the unit remaining ratio according to the evolving architecture.
Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MGAS outperforms other state-of-the-art methods in achieving a better trade-off between model performance and model size.
arXiv Detail & Related papers (2023-10-23T16:32:18Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - TransNormerLLM: A Faster and Better Large Language Model with Improved
TransNormer [34.790081960470964]
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM)
We make advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization.
We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus.
arXiv Detail & Related papers (2023-07-27T16:45:33Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.