Related papers: Efficient Inference of Sub-Item Id-based Sequential Recommendation Models with Millions of Items

Efficient Inference of Sub-Item Id-based Sequential Recommendation Models with Millions of Items

URL: http://arxiv.org/abs/2408.09992v1
Date: Mon, 19 Aug 2024 13:43:48 GMT
Title: Efficient Inference of Sub-Item Id-based Sequential Recommendation Models with Millions of Items
Authors: Aleksandr V. Petrov, Craig Macdonald, Nicola Tonellotto,
Abstract summary: We show that it is possible to improve RecJPQ-based models' inference efficiency using the PQTopK algorithm. We speed up RecJPQ-enhanced SASRec by a factor of 4.5 x compared to the original SASRec's inference method and by a factor of 1.56 x compared to the method implemented in RecJPQ code.
Score: 63.117573355917465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based recommender systems, such as BERT4Rec or SASRec, achieve state-of-the-art results in sequential recommendation. However, it is challenging to use these models in production environments with catalogues of millions of items: scaling Transformers beyond a few thousand items is problematic for several reasons, including high model memory consumption and slow inference. In this respect, RecJPQ is a state-of-the-art method of reducing the models' memory consumption; RecJPQ compresses item catalogues by decomposing item IDs into a small number of shared sub-item IDs. Despite reporting the reduction of memory consumption by a factor of up to 50x, the original RecJPQ paper did not report inference efficiency improvements over the baseline Transformer-based models. Upon analysing RecJPQ's scoring algorithm, we find that its efficiency is limited by its use of score accumulators for each item, which prevents parallelisation. In contrast, LightRec (a non-sequential method that uses a similar idea of sub-ids) reported large inference efficiency improvements using an algorithm we call PQTopK. We show that it is also possible to improve RecJPQ-based models' inference efficiency using the PQTopK algorithm. In particular, we speed up RecJPQ-enhanced SASRec by a factor of 4.5 x compared to the original SASRec's inference method and by a factor of 1.56 x compared to the method implemented in RecJPQ code on a large-scale Gowalla dataset with more than a million items. Further, using simulated data, we show that PQTopK remains efficient with catalogues of up to tens of millions of items, removing one of the last obstacles to using Transformer-based models in production environments with large catalogues.

Related papers

Efficient Recommendation with Millions of Items by Dynamic Pruning of Sub-Item Embeddings [63.117573355917465]
We propose a dynamic pruning algorithm to efficiently find the top highest-scored items in a large item catalogue. Our RecJPQPrune algorithm is safe-up-to-rank K since it theoretically guarantees that no potentially high-scored item is excluded from the final top K recommendation list. Our experiments on two large datasets and three recommendation models demonstrate the efficiency achievable using RecJPQPrune.
arXiv Detail & Related papers (2025-05-01T14:36:33Z)
FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score. We propose a principled metric to replace each pruned block using a weight-sharing mechanism. Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z)
Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs [4.165917157093442]
This paper introduces a novel Scalable Cross-Entropy (SCE) loss function in the sequential learning setup. It approximates the CE loss for datasets with large-size catalogs, enhancing both time efficiency and memory usage without compromising recommendations quality. Experimental results on multiple datasets demonstrate the effectiveness of SCE in reducing peak memory usage by a factor of up to 100 compared to the alternatives.
arXiv Detail & Related papers (2024-09-27T13:17:59Z)
Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement Learning [16.287067991245962]
In real-world systems, an important consideration for a new model is novelty of its top-k recommendations. We propose a reinforcement learning (RL) formulation where large language models provide feedback for the novel items. We evaluate the proposed algorithm on improving novelty for a query-ad recommendation task on a large-scale search engine.
arXiv Detail & Related papers (2024-06-20T10:20:02Z)
Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z)
MMGRec: Multimodal Generative Recommendation with Transformer Model [81.61896141495144]
MMGRec aims to introduce a generative paradigm into multimodal recommendation. We first devise a hierarchical quantization method Graph CF-RQVAE to assign Rec-ID for each item from its multimodal information. We then train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences.
arXiv Detail & Related papers (2024-04-25T12:11:27Z)
How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales. We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z)
DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models. We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn. Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z)
One model Packs Thousands of Items with Recurrent Conditional Query Learning [8.821298331302563]
We propose a Recurrent Conditional Query Learning (RCQL) method to solve both 2D and 3D packing problems. RCQL reduces the average bin gap ratio by 1.83% in offline 2D 40-box cases and 7.84% in 3D cases compared with state-of-the-art methods.
arXiv Detail & Related papers (2021-11-12T14:00:30Z)
A Generic Network Compression Framework for Sequential Recommender Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations. We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed. By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)
QCBA: Improving Rule Classifiers Learned from Quantitative Data by Recovering Information Lost by Discretisation [5.667821885065119]
This paper describes new rule tuning steps that aim to recover lost information in the discretisation and new pruning techniques. The proposed QCBA method was initially developed to postprocess quantitative attributes in models generated by the Classification based on associations (CBA) algorithm. Benchmarks on 22 datasets from the UCI repository show smaller size and the overall best predictive performance for FOIL2+QCBA compared to all seven baselines.
arXiv Detail & Related papers (2017-11-28T08:09:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.