Related papers: Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

URL: http://arxiv.org/abs/2602.10016v2
Date: Fri, 13 Feb 2026 19:00:33 GMT
Title: Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design
Authors: Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Ellie Wen, Jiyan Yang, Huayu Li,
Abstract summary: We introduce Kunlun, a scalable architecture that improves model efficiency and resource allocation.<n>Kunlun is now deployed in major Meta Ads models, delivering significant production impact.
Score: 39.35320040234209
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.

Related papers

LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation [9.59487558742976]
We present LLaTTE, a scalable transformer architecture for production ads recommendation.<n>We demonstrate that sequence modeling in recommendation systems follows predictable power-law scaling similar to LLMs.<n>We find that semantic features bend the scaling curve, enabling the model to effectively utilize the capacity of deeper and longer architectures.
arXiv Detail & Related papers (2026-01-27T21:59:36Z)
Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment [16.883389041355073]
We propose a framework designed for the development and deployment of hyperscale recommendation FMs.<n>In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge.<n>This knowledge is then efficiently transferred to various lightweight, surface-specific "expert" models via target-aware embeddings.
arXiv Detail & Related papers (2025-08-04T22:03:13Z)
MTGR: Industrial-Scale Generative Recommendation Framework in Meituan [32.12374665716164]
We propose MTGR (Meituan Generative Recommendation) to address this issue.<n> MTGR achieves training and inference acceleration through user-level compression to ensure efficient scaling.<n>This breakthrough was successfully deployed on Meituan, the world's largest food delivery platform.
arXiv Detail & Related papers (2025-05-24T11:47:28Z)
Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC [8.837470787975308]
Large Language Models (LLMs) on edge devices offer significant privacy benefits.<n>These on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques.<n>We introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs.
arXiv Detail & Related papers (2025-05-21T02:23:01Z)
Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization'<n>This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z)
A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation.<n> deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency.<n>This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z)
Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.<n>We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches [64.42735183056062]
Large language models (LLMs) have evolved from specialized deep models to versatile foundation models.<n>LLMs require fine-tuning on local datasets and substantial memory for deployment over the network edges.<n>LLMs have been expanded beyond text generation to create images, audio, video, and multi-modal content.<n>Model fine-tuning and model-compression techniques have been developed to support the sustainable growth of LLMs.
arXiv Detail & Related papers (2024-08-20T09:42:17Z)
TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns.<n>Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale [11.121380180647769]
We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware. We also discuss the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan. We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering.
arXiv Detail & Related papers (2021-05-26T16:42:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.