Text-Based Product Matching -- Semi-Supervised Clustering Approach
- URL: http://arxiv.org/abs/2402.10091v1
- Date: Thu, 1 Feb 2024 18:52:26 GMT
- Title: Text-Based Product Matching -- Semi-Supervised Clustering Approach
- Authors: Alicja Martinek, Szymon {\L}ukasik, Amir H. Gandomi
- Abstract summary: This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach.
We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset.
- Score: 9.748519919202986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Matching identical products present in multiple product feeds constitutes a
crucial element of many tasks of e-commerce, such as comparing product
offerings, dynamic price optimization, and selecting the assortment
personalized for the client. It corresponds to the well-known machine learning
task of entity matching, with its own specificity, like omnipresent
unstructured data or inaccurate and inconsistent product descriptions. This
paper aims to present a new philosophy to product matching utilizing a
semi-supervised clustering approach. We study the properties of this method by
experimenting with the IDEC algorithm on the real-world dataset using
predominantly textual features and fuzzy string matching, with more standard
approaches as a point of reference. Encouraging results show that unsupervised
matching, enriched with a small annotated sample of product links, could be a
possible alternative to the dominant supervised strategy, requiring extensive
manual data labeling.
Related papers
- Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models [50.370043676415875]
In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods.
We introduce the MIMEX dataset, comprising 28 distinct product categories.
We benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset.
arXiv Detail & Related papers (2024-09-23T12:28:40Z) - STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM [59.08493154172207]
We propose a unified framework to streamline the semantic tokenization and generative recommendation process.
We formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task.
All these tasks are framed in a generative manner and trained using a single large language model (LLM) backbone.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - MMGRec: Multimodal Generative Recommendation with Transformer Model [81.61896141495144]
MMGRec aims to introduce a generative paradigm into multimodal recommendation.
We first devise a hierarchical quantization method Graph CF-RQVAE to assign Rec-ID for each item from its multimodal information.
We then train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences.
arXiv Detail & Related papers (2024-04-25T12:11:27Z) - Enhanced E-Commerce Attribute Extraction: Innovating with Decorative
Relation Correction and LLAMA 2.0-Based Annotation [4.81846973621209]
We propose a pioneering framework that integrates BERT for classification, a Conditional Random Fields (CRFs) layer for attribute value extraction, and Large Language Models (LLMs) for data annotation.
Our approach capitalizes on the robust representation learning of BERT, synergized with the sequence decoding prowess of CRFs, to adeptly identify and extract attribute values.
Our methodology is rigorously validated on various datasets, including Walmart, BestBuy's e-commerce NER dataset, and the CoNLL dataset.
arXiv Detail & Related papers (2023-12-09T08:26:30Z) - A Unified Generative Approach to Product Attribute-Value Identification [6.752749933406399]
We explore a generative approach to the product attribute-value identification (PAVI) task.
We finetune a pre-trained generative model, T5, to decode a set of attribute-value pairs as a target sequence from the given product text.
Experimental results confirm that our generation-based approach outperforms the existing extraction and classification-based methods.
arXiv Detail & Related papers (2023-06-09T00:33:30Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Interpretable Methods for Identifying Product Variants [0.2589904091148018]
We introduce a novel approach to identifying product variants.
It combines both constrained clustering and tailored NLP techniques.
We design the algorithm to meet certain business criteria, including meeting high accuracy requirements.
arXiv Detail & Related papers (2021-04-12T14:37:16Z) - Automatic Validation of Textual Attribute Values in E-commerce Catalog
by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge.
It can learn transferable knowledge from a subset of categories with limited labeled data.
It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z) - A Hybrid Approach to Enhance Pure Collaborative Filtering based on
Content Feature Relationship [0.17188280334580192]
We introduce a novel method to extract the implicit relationship between content features using a sort of well-known methods from the natural language processing domain, namely Word2Vec.
Next, we propose a novel content-based recommendation system that employs the relationship to determine vector representations for items.
Our evaluation results demonstrate that it can predict the preference a user would have for a set of items as good as pure collaborative filtering.
arXiv Detail & Related papers (2020-05-17T02:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.