Related papers: Scaling Recommender Transformers to One Billion Parameters

Scaling Recommender Transformers to One Billion Parameters

URL: http://arxiv.org/abs/2507.15994v1
Date: Mon, 21 Jul 2025 18:30:43 GMT
Title: Scaling Recommender Transformers to One Billion Parameters
Authors: Kirill Khrylchenko, Artem Matveev, Sergei Makeev, Vladimir Baikalov,
Abstract summary: We present a recipe for training large transformer recommenders with up to a billion parameters.<n>We show that autoregressive learning on user histories naturally decomposes into two subtasks, feedback prediction and next-item prediction.<n>We report a successful deployment of our proposed architecture on a large-scale music platform serving millions of users.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large transformer models have been successfully used in many real-world applications such as natural language processing, computer vision, and speech processing, scaling transformers for recommender systems remains a challenging problem. Recently, Generative Recommenders framework was proposed to scale beyond typical Deep Learning Recommendation Models (DLRMs). Reformulation of recommendation as sequential transduction task led to improvement of scaling properties in terms of compute. Nevertheless, the largest encoder configuration reported by the HSTU authors amounts only to ~176 million parameters, which is considerably smaller than the hundreds of billions or even trillions of parameters common in modern language models. In this work, we present a recipe for training large transformer recommenders with up to a billion parameters. We show that autoregressive learning on user histories naturally decomposes into two subtasks, feedback prediction and next-item prediction, and demonstrate that such a decomposition scales effectively across a wide range of transformer sizes. Furthermore, we report a successful deployment of our proposed architecture on a large-scale music platform serving millions of users. According to our online A/B tests, this new model increases total listening time by +2.26% and raises the likelihood of user likes by +6.37%, constituting (to our knowledge) the largest improvement in recommendation quality reported for any deep learning-based system in the platform's history.

Related papers

A Novel Mamba-based Sequential Recommendation Method [4.941272356564765]
Sequential recommendation (SR) encodes user activity to predict the next action.<n> Transformer-based models have proven effective for sequential recommendation, but the complexity of the self-attention module in Transformers scales quadratically with the sequence length.<n>We propose a novel multi-head latent Mamba architecture, which employs multiple low-dimensional Mamba layers and fully connected layers.
arXiv Detail & Related papers (2025-04-10T02:43:19Z)
Scaling Sequential Recommendation Models with Transformers [0.0]
We take inspiration from the scaling laws observed in training large language models, and explore similar principles for sequential recommendation.<n> Compute-optimal training is possible but requires a careful analysis of the compute-performance trade-offs specific to the application.<n>We also show that performance scaling translates to downstream tasks by fine-tuning larger pre-trained models on smaller task-specific domains.
arXiv Detail & Related papers (2024-12-10T15:20:56Z)
Scaling New Frontiers: Insights into Large Recommendation Models [74.77410470984168]
Meta's generative recommendation model HSTU illustrates the scaling laws of recommendation systems by expanding parameters to thousands of billions.<n>We conduct comprehensive ablation studies to explore the origins of these scaling laws.<n>We offer insights into future directions for large recommendation models.
arXiv Detail & Related papers (2024-12-01T07:27:20Z)
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations [11.198481792194452]
Large-scale recommendation systems need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems.
arXiv Detail & Related papers (2024-02-27T02:37:37Z)
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging [34.643717080240584]
We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models.
arXiv Detail & Related papers (2024-02-04T21:44:09Z)
Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability [73.34532767873785]
We propose the concept of Information Density'' (ID) to indicate whether a matrix strongly belongs to certain feature spaces. We introduce the Dense Information Prompt (DIP) to enhance information density to improve generalization. DIP significantly reduces the number of tunable parameters and the requisite storage space, making it particularly advantageous in resource-constrained settings.
arXiv Detail & Related papers (2023-12-17T20:42:43Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks. Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention. Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z)
DeepNet: Scaling Transformers to 1,000 Layers [106.33669415337135]
We introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer. In-depth theoretical analysis shows that model updates can be bounded in a stable way. We successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
arXiv Detail & Related papers (2022-03-01T15:36:38Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration. We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech. TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.