Text-Based Product Matching -- Semi-Supervised Clustering Approach
- URL: http://arxiv.org/abs/2402.10091v1
- Date: Thu, 1 Feb 2024 18:52:26 GMT
- Title: Text-Based Product Matching -- Semi-Supervised Clustering Approach
- Authors: Alicja Martinek, Szymon {\L}ukasik, Amir H. Gandomi
- Abstract summary: This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach.
We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset.
- Score: 9.748519919202986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Matching identical products present in multiple product feeds constitutes a
crucial element of many tasks of e-commerce, such as comparing product
offerings, dynamic price optimization, and selecting the assortment
personalized for the client. It corresponds to the well-known machine learning
task of entity matching, with its own specificity, like omnipresent
unstructured data or inaccurate and inconsistent product descriptions. This
paper aims to present a new philosophy to product matching utilizing a
semi-supervised clustering approach. We study the properties of this method by
experimenting with the IDEC algorithm on the real-world dataset using
predominantly textual features and fuzzy string matching, with more standard
approaches as a point of reference. Encouraging results show that unsupervised
matching, enriched with a small annotated sample of product links, could be a
possible alternative to the dominant supervised strategy, requiring extensive
manual data labeling.
Related papers
- Enhanced E-Commerce Attribute Extraction: Innovating with Decorative
Relation Correction and LLAMA 2.0-Based Annotation [4.81846973621209]
We propose a pioneering framework that integrates BERT for classification, a Conditional Random Fields (CRFs) layer for attribute value extraction, and Large Language Models (LLMs) for data annotation.
Our approach capitalizes on the robust representation learning of BERT, synergized with the sequence decoding prowess of CRFs, to adeptly identify and extract attribute values.
Our methodology is rigorously validated on various datasets, including Walmart, BestBuy's e-commerce NER dataset, and the CoNLL dataset.
arXiv Detail & Related papers (2023-12-09T08:26:30Z) - JPAVE: A Generation and Classification-based Model for Joint Product
Attribute Prediction and Value Extraction [59.94977231327573]
We propose a multi-task learning model with value generation/classification and attribute prediction called JPAVE.
Two variants of our model are designed for open-world and closed-world scenarios.
Experimental results on a public dataset demonstrate the superiority of our model compared with strong baselines.
arXiv Detail & Related papers (2023-11-07T18:36:16Z) - Product Attribute Value Extraction using Large Language Models [56.96665345570965]
State-of-the-art attribute/value extraction methods based on pre-trained language models (PLMs) face two drawbacks.
We explore the potential of using large language models (LLMs) as a more training data-efficient and more robust alternative to existing attribute/value extraction methods.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - A Unified Generative Approach to Product Attribute-Value Identification [6.752749933406399]
We explore a generative approach to the product attribute-value identification (PAVI) task.
We finetune a pre-trained generative model, T5, to decode a set of attribute-value pairs as a target sequence from the given product text.
Experimental results confirm that our generation-based approach outperforms the existing extraction and classification-based methods.
arXiv Detail & Related papers (2023-06-09T00:33:30Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Interpretable Methods for Identifying Product Variants [0.2589904091148018]
We introduce a novel approach to identifying product variants.
It combines both constrained clustering and tailored NLP techniques.
We design the algorithm to meet certain business criteria, including meeting high accuracy requirements.
arXiv Detail & Related papers (2021-04-12T14:37:16Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Automatic Validation of Textual Attribute Values in E-commerce Catalog
by Learning with Limited Labeled Data [61.789797281676606]
We propose a novel meta-learning latent variable approach, called MetaBridge.
It can learn transferable knowledge from a subset of categories with limited labeled data.
It can capture the uncertainty of never-seen categories with unlabeled data.
arXiv Detail & Related papers (2020-06-15T21:31:05Z) - A Hybrid Approach to Enhance Pure Collaborative Filtering based on
Content Feature Relationship [0.17188280334580192]
We introduce a novel method to extract the implicit relationship between content features using a sort of well-known methods from the natural language processing domain, namely Word2Vec.
Next, we propose a novel content-based recommendation system that employs the relationship to determine vector representations for items.
Our evaluation results demonstrate that it can predict the preference a user would have for a set of items as good as pure collaborative filtering.
arXiv Detail & Related papers (2020-05-17T02:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.