Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)
- URL: http://arxiv.org/abs/2106.09493v1
- Date: Sat, 12 Jun 2021 08:45:56 GMT
- Title: Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)
- Authors: Ravi Shankar Mishra, Kartik Mehta, Nikhil Rasiwasia
- Abstract summary: We present SANTA, a framework to automatically normalize E-commerce attribute values.
We first perform an extensive study of nine syntactic matching algorithms.
We argue that string similarity alone is not sufficient for attribute normalization.
- Score: 0.25782420501870296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present SANTA, a scalable framework to automatically
normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of
pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute
normalization focused on fuzzy string matching (also referred as syntactic
matching in this paper). In this work, we first perform an extensive study of
nine syntactic matching algorithms and establish that 'cosine' similarity leads
to best results, showing 2.7% improvement over commonly used Jaccard index.
Next, we argue that string similarity alone is not sufficient for attribute
normalization as many surface forms require going beyond syntactic matching
(e.g. "720p" and "HD" are synonyms). While semantic techniques like
unsupervised embeddings (e.g. word2vec/fastText) have shown good results in
word similarity tasks, we observed that they perform poorly to distinguish
between close canonical forms, as these close forms often occur in similar
contexts. We propose to learn token embeddings using a twin network with
triplet loss. We propose an embedding learning task leveraging raw attribute
values and product titles to learn these embeddings in a self-supervised
fashion. We show that providing supervision using our proposed task improves
over both syntactic and unsupervised embeddings based techniques for attribute
normalization. Experiments on a real-world attribute normalization dataset of
50 attributes show that the embeddings trained using our proposed approach
obtain 2.3% improvement over best string matching and 19.3% improvement over
best unsupervised embeddings.
Related papers
- SimMatchV2: Semi-Supervised Learning with Graph Consistency [53.31681712576555]
We introduce a new semi-supervised learning algorithm - SimMatchV2.
It formulates various consistency regularizations between labeled and unlabeled data from the graph perspective.
SimMatchV2 has been validated on multiple semi-supervised learning benchmarks.
arXiv Detail & Related papers (2023-08-13T05:56:36Z) - VacancySBERT: the approach for representation of titles and skills for
semantic similarity search in the recruitment domain [0.0]
The paper focuses on deep learning semantic search algorithms applied in the HR domain.
The aim of the article is developing a novel approach to training a Siamese network to link the skills mentioned in the job ad with the title.
arXiv Detail & Related papers (2023-07-31T13:21:15Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Improving Contextual Recognition of Rare Words with an Alternate
Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset.
We show results for shallow fusion contextual biasing applied to two different decoding algorithms.
We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z) - Cross-domain Speech Recognition with Unsupervised Character-level
Distribution Matching [60.8427677151492]
We propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains.
Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR.
arXiv Detail & Related papers (2021-04-15T14:36:54Z) - R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic
Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching.
We first employ BERT to encode the input sentences from a global perspective.
Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective.
To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z) - CoMatch: Semi-supervised Learning with Contrastive Graph Regularization [86.84486065798735]
CoMatch is a new semi-supervised learning method that unifies dominant approaches.
It achieves state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2020-11-23T02:54:57Z) - UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical
Semantic Change Detection [5.099262949886174]
This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time.
We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings.
arXiv Detail & Related papers (2020-04-30T18:43:57Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.