FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets
- URL: http://arxiv.org/abs/2509.20904v2
- Date: Fri, 26 Sep 2025 00:53:31 GMT
- Title: FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets
- Authors: Kairui Fu, Tao Zhang, Shuwen Xiao, Ziyang Wang, Xinming Zhang, Chenchi Zhang, Yuliang Yan, Junjun Zheng, Yu Li, Zhihong Chen, Jian Wu, Xiangheng Kong, Shengyu Zhang, Kun Kuang, Yuning Jiang, Bo Zheng,
- Abstract summary: FORGE is a benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets.<n>For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half.
- Score: 64.51403245281547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on the "Guess You Like" section of Taobao's homepage shows a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at https://github.com/selous123/al_sid.
Related papers
- R2LED: Equipping Retrieval and Refinement in Lifelong User Modeling with Semantic IDs for CTR Prediction [23.668401664583758]
We propose a novel paradigm that equips retrieval and refinement in Lifelong User Modeling with SEmantic IDs (R2LED)<n>First, we introduce a Multi-route Mixed Retrieval for the retrieval stage. On the other hand, a mixed retrieval mechanism is proposed to efficiently retrieve candidates from both collaborative and semantic views.<n>For refinement, we design a Bi-level Fusion Refinement, including a target-aware cross-attention for route-level fusion and a gate mechanism for SID-level fusion.
arXiv Detail & Related papers (2026-02-06T11:27:20Z) - Masked Diffusion Generative Recommendation [14.679550929790151]
Generative recommendation (GR) typically first quantizes continuous item embeddings into multi-level semantic IDs (SIDs)<n>We propose MDGR, a Masked Diffusion Generative Recommendation framework that reshapes the GR pipeline from three perspectives: codebook, training, and inference.
arXiv Detail & Related papers (2026-01-27T11:39:02Z) - OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data [6.819856310521865]
We introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data.<n> Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works.
arXiv Detail & Related papers (2025-08-28T08:51:14Z) - TaoSR1: The Thinking Model for E-commerce Relevance Search [15.137901457184839]
BERT-based models excel at semantic matching but lack complex reasoning capabilities.<n>We propose a framework to directly deploy Large Language Models for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility.<n>Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO)
arXiv Detail & Related papers (2025-08-17T13:48:48Z) - eSapiens: A Platform for Secure and Auditable Retrieval-Augmented Generation [10.667949307405983]
eSapiens is an AI-as-a-Service (AI) platform engineered around a business-oriented trifecta: proprietary data, operational, and any major Large Language Model (LLM)<n>eSapiens gives businesses full control over their AI assets, keeping everything in-house for AI knowledge retention and data security.
arXiv Detail & Related papers (2025-07-13T11:41:44Z) - Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications.<n>The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.<n>We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z) - Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [63.484378941471114]
We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.<n>Experiments on various benchmarks demonstrate that DataTailor achieves 101.3% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z) - CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Generalizable Metric Network for Cross-domain Person Re-identification [55.71632958027289]
Cross-domain (i.e., domain generalization) scene presents a challenge in Re-ID tasks.
Most existing methods aim to learn domain-invariant or robust features for all domains.
We propose a Generalizable Metric Network (GMN) to explore sample similarity in the sample-pair space.
arXiv Detail & Related papers (2023-06-21T03:05:25Z) - GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for
Remote Sensing Data [27.63411386396492]
This paper introduces a new benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data.
The proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data.
arXiv Detail & Related papers (2023-05-24T09:03:18Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Optimizing Performance of Federated Person Re-identification:
Benchmarking and Analysis [14.545746907150436]
FedReID implements federated learning, an emerging distributed training method, to person ReID.
FedReID preserves data privacy by aggregating model updates, instead of raw data, from clients to a central server.
arXiv Detail & Related papers (2022-05-24T15:20:32Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.