GSID: Generative Semantic Indexing for E-Commerce Product Understanding
- URL: http://arxiv.org/abs/2509.23860v1
- Date: Sun, 28 Sep 2025 12:58:05 GMT
- Title: GSID: Generative Semantic Indexing for E-Commerce Product Understanding
- Authors: Haiyang Yang, Qinye Xie, Qingheng Zhang, Liyu Chen, Huike Zou, Chengbao Lian, Shuguang Han, Fei Huang, Jufeng Chen, Bo Zheng,
- Abstract summary: We propose textbfGenerative textbfSemantic textbfIntextbfDexings (GSID) to generate product structured representations.<n>GSID consists of two key components: (1) Pre-training on unstructured product metadata to learn in-domain semantic embeddings, and (2) Generating more effective semantic codes tailored for downstream applications.<n>It has been successfully deployed on the real-world e-commerce platform, achieving promising results on product understanding and other downstream tasks.
- Score: 32.89899469298562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Structured representation of product information is a major bottleneck for the efficiency of e-commerce platforms, especially in second-hand ecommerce platforms. Currently, most product information are organized based on manually curated product categories and attributes, which often fail to adequately cover long-tail products and do not align well with buyer preference. To address these problems, we propose \textbf{G}enerative \textbf{S}emantic \textbf{I}n\textbf{D}exings (GSID), a data-driven approach to generate product structured representations. GSID consists of two key components: (1) Pre-training on unstructured product metadata to learn in-domain semantic embeddings, and (2) Generating more effective semantic codes tailored for downstream product-centric applications. Extensive experiments are conducted to validate the effectiveness of GSID, and it has been successfully deployed on the real-world e-commerce platform, achieving promising results on product understanding and other downstream tasks.
Related papers
- FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets [64.51403245281547]
FORGE is a benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets.<n>For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half.
arXiv Detail & Related papers (2025-09-25T08:44:22Z) - EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association [83.4879773429742]
This paper defines the task of E-commerce Script Planning (EcomScript) as three sequential subtasks.<n>We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step.<n>We construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products.
arXiv Detail & Related papers (2025-05-21T07:21:38Z) - eC-Tab2Text: Aspect-Based Text Generation from e-Commerce Product Tables [6.384763560610077]
We introduce eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce.<n>We focus on text generation from product tables, enabling LLMs to produce high-quality, attribute-specific product reviews.<n>Our results demonstrate substantial improvements in generating contextually accurate reviews.
arXiv Detail & Related papers (2025-02-20T18:41:48Z) - Hi-Gen: Generative Retrieval For Large-Scale Personalized E-commerce Search [9.381220988816219]
We introduce an efficient Hierarchical encoding-decoding Generative retrieval method (Hi-Gen) for large-scale personalized E-commerce search systems.
We first design a representation learning model using metric learning to learn discriminative feature representations of items.
Then, we propose a category-guided hierarchical clustering scheme that makes full use of the semantic and efficiency information of items.
arXiv Detail & Related papers (2024-04-24T06:05:35Z) - Enhanced E-Commerce Attribute Extraction: Innovating with Decorative
Relation Correction and LLAMA 2.0-Based Annotation [4.81846973621209]
We propose a pioneering framework that integrates BERT for classification, a Conditional Random Fields (CRFs) layer for attribute value extraction, and Large Language Models (LLMs) for data annotation.
Our approach capitalizes on the robust representation learning of BERT, synergized with the sequence decoding prowess of CRFs, to adeptly identify and extract attribute values.
Our methodology is rigorously validated on various datasets, including Walmart, BestBuy's e-commerce NER dataset, and the CoNLL dataset.
arXiv Detail & Related papers (2023-12-09T08:26:30Z) - EcomGPT: Instruction-tuning Large Language Models with Chain-of-Task
Tasks for E-commerce [68.72104414369635]
We propose the first e-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data.
EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks.
arXiv Detail & Related papers (2023-08-14T06:49:53Z) - Automatic Controllable Product Copywriting for E-Commerce [58.97059802658354]
We deploy an E-commerce Prefix-based Controllable Copywriting Generation into the JD.com e-commerce recommendation platform.
We conduct experiments to validate the effectiveness of the proposed EPCCG.
We introduce the deployed architecture which cooperates with the EPCCG into the real-time JD.com e-commerce recommendation platform.
arXiv Detail & Related papers (2022-06-21T04:18:52Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.