Related papers: STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM

STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM

URL: http://arxiv.org/abs/2409.07276v2
Date: Fri, 13 Sep 2024 04:16:55 GMT
Title: STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM
Authors: Qijiong Liu, Jieming Zhu, Lu Fan, Zhou Zhao, Xiao-Ming Wu,
Abstract summary: We propose a unified framework to streamline the semantic tokenization and generative recommendation process. We formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task. All these tasks are framed in a generative manner and trained using a single large language model (LLM) backbone.
Score: 59.08493154172207
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional recommendation models often rely on unique item identifiers (IDs) to distinguish between items, which can hinder their ability to effectively leverage item content information and generalize to long-tail or cold-start items. Recently, semantic tokenization has been proposed as a promising solution that aims to tokenize each item's semantic representation into a sequence of discrete tokens. In this way, it preserves the item's semantics within these tokens and ensures that semantically similar items are represented by similar tokens. These semantic tokens have become fundamental in training generative recommendation models. However, existing generative recommendation methods typically involve multiple sub-models for embedding, quantization, and recommendation, leading to an overly complex system. In this paper, we propose to streamline the semantic tokenization and generative recommendation process with a unified framework, dubbed STORE, which leverages a single large language model (LLM) for both tasks. Specifically, we formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task. All these tasks are framed in a generative manner and trained using a single LLM backbone. Extensive experiments have been conducted to validate the effectiveness of our STORE framework across various recommendation tasks and datasets. We will release the source code and configurations for reproducible research.

Related papers

Universal Item Tokenization for Transferable Generative Recommendation [89.42584009980676]
We propose UTGRec, a universal item tokenization approach for transferable Generative Recommendation. By devising tree-structured codebooks, we discretize content representations into corresponding codes for item tokenization. For raw content reconstruction, we employ dual lightweight decoders to reconstruct item text and images from discrete representations. For collaborative knowledge integration, we assume that co-occurring items are similar and integrate collaborative signals through co-occurrence alignment and reconstruction.
arXiv Detail & Related papers (2025-04-06T08:07:49Z)
Pre-training Generative Recommender with Multi-Identifier Item Tokenization [78.87007819266957]
We propose MTGRec to augment token sequence data for Generative Recommender pre-training. Our approach involves two key innovations: multi-identifier item tokenization and curriculum recommender pre-training. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and generative recommendation baselines.
arXiv Detail & Related papers (2025-04-06T08:03:03Z)
Order-agnostic Identifier for Large Language Model-based Generative Recommendation [94.37662915542603]
Items are assigned identifiers for Large Language Models (LLMs) to encode user history and generate the next item. Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings. We propose SETRec, which leverages semantic tokenizers to obtain order-agnostic multi-dimensional tokens.
arXiv Detail & Related papers (2025-02-15T15:25:38Z)
Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs) Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens. We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z)
TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation [16.93374578679005]
TokenRec is a novel framework for tokenizing and retrieving large-scale language models (LLMs) based Recommender Systems (RecSys) Our strategy, Masked Vector-Quantized (MQ) Tokenizer, quantizes the masked user/item representations learned from collaborative filtering into discrete tokens. Our generative retrieval paradigm is designed to efficiently recommend top-$K$ items for users to eliminate the need for auto-regressive decoding and beam search processes.
arXiv Detail & Related papers (2024-06-15T00:07:44Z)
Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens [51.584024345378005]
We show how to effectively tokenize users and items in Large Language Models (LLMs)-based recommender systems. We emphasize the role of out-of-vocabulary (OOV) tokens in addition to the in-vocabulary ones. Our proposed framework outperforms existing state-of-the-art methods across various downstream recommendation tasks.
arXiv Detail & Related papers (2024-06-12T17:59:05Z)
SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Learnable Item Tokenization for Generative Recommendation [78.30417863309061]
We propose LETTER (a LEarnable Tokenizer for generaTivE Recommendation), which integrates hierarchical semantics, collaborative signals, and code assignment diversity. LETTER incorporates Residual Quantized VAE for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias.
arXiv Detail & Related papers (2024-05-12T15:49:38Z)
Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS) Existing methods suffer from a granularity inconsistency regarding the usage of group tokens. We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z)
Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z)
LabelPrompt: Effective Prompt-based Learning for Relation Classification [31.291466190218912]
This paper presents a novel prompt-based learning method, namely LabelPrompt, for the relation classification task. Motivated by the intuition to GIVE MODEL CHOICES!'', we first define additional tokens to represent relation labels, which regard these tokens as the verbaliser with semantic initialisation. Then, to mitigate inconsistency between predicted relations and given entities, we implement an entity-aware module with contrastive learning.
arXiv Detail & Related papers (2023-02-16T04:06:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.