Purely Semantic Indexing for LLM-based Generative Recommendation and Retrieval
- URL: http://arxiv.org/abs/2509.16446v1
- Date: Fri, 19 Sep 2025 21:59:55 GMT
- Title: Purely Semantic Indexing for LLM-based Generative Recommendation and Retrieval
- Authors: Ruohan Zhang, Jiacheng Li, Julian McAuley, Yupeng Hou,
- Abstract summary: We propose purely semantic indexing to generate unique, semantic-preserving IDs without appending non-semantic tokens.<n>We enable unique ID assignment by relaxing the strict nearest-centroid selection and introduce two model-agnostic algorithms.
- Score: 28.366331215978445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic identifiers (IDs) have proven effective in adapting large language models for generative recommendation and retrieval. However, existing methods often suffer from semantic ID conflicts, where semantically similar documents (or items) are assigned identical IDs. A common strategy to avoid conflicts is to append a non-semantic token to distinguish them, which introduces randomness and expands the search space, therefore hurting performance. In this paper, we propose purely semantic indexing to generate unique, semantic-preserving IDs without appending non-semantic tokens. We enable unique ID assignment by relaxing the strict nearest-centroid selection and introduce two model-agnostic algorithms: exhaustive candidate matching (ECM) and recursive residual searching (RRS). Extensive experiments on sequential recommendation, product search, and document retrieval tasks demonstrate that our methods improve both overall and cold-start performance, highlighting the effectiveness of ensuring ID uniqueness.
Related papers
- Unleash the Potential of Long Semantic IDs for Generative Recommendation [5.6264583086973685]
ACERec is a novel framework that decouples the gap between fine-grained tokenization and efficient sequential modeling.<n>It consistently outperforms state-of-the-art baselines on six real-world benchmarks.
arXiv Detail & Related papers (2026-02-14T03:15:31Z) - Differentiable Semantic ID for Generative Recommendation [65.83703273297492]
Generative recommendation provides a novel paradigm in which each item is represented by a discrete semantic ID (SID) learned from rich content.<n>In practice, SIDs are typically optimized only for content reconstruction rather than recommendation accuracy.<n>A natural approach is to make semantic indexing differentiable so that recommendation gradients can directly influence SID learning.<n>We propose DIGER, a first step toward effective differentiable semantic IDs for generative recommendation.
arXiv Detail & Related papers (2026-01-27T15:34:11Z) - Semantic IDs for Joint Generative Search and Recommendation [39.49814138519702]
Generative models are emerging as a unified solution for powering both recommendation and search tasks.<n>We show how to construct Semantic IDs that perform well both in search and recommendation when using a unified model.
arXiv Detail & Related papers (2025-08-14T09:28:49Z) - SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs [70.79124435220695]
We propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE)<n>We first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation.<n>We then introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination.
arXiv Detail & Related papers (2025-04-17T17:59:27Z) - Order-agnostic Identifier for Large Language Model-based Generative Recommendation [94.37662915542603]
Items are assigned identifiers for Large Language Models (LLMs) to encode user history and generate the next item.<n>Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings.<n>We propose SETRec, which leverages semantic tokenizers to obtain order-agnostic multi-dimensional tokens.
arXiv Detail & Related papers (2025-02-15T15:25:38Z) - Summarization-Based Document IDs for Generative Retrieval with Language Models [65.11811787587403]
We introduce summarization-based document IDs, in which each document's ID is composed of an extractive summary or abstractive keyphrases.
We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively.
We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO.
arXiv Detail & Related papers (2023-11-14T23:28:36Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - Recommender Systems with Generative Retrieval [58.454606442670034]
We propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates.
To that end, we create semantically meaningful of codewords to serve as a Semantic ID for each item.
We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets.
arXiv Detail & Related papers (2023-05-08T21:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.