Related papers: Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

URL: http://arxiv.org/abs/2106.09493v1
Date: Sat, 12 Jun 2021 08:45:56 GMT
Title: Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)
Authors: Ravi Shankar Mishra, Kartik Mehta, Nikhil Rasiwasia
Abstract summary: We present SANTA, a framework to automatically normalize E-commerce attribute values. We first perform an extensive study of nine syntactic matching algorithms. We argue that string similarity alone is not sufficient for attribute normalization.
Score: 0.25782420501870296
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.

Related papers

Unified Semantic and ID Representation Learning for Deep Recommenders [28.709935854073535]
We propose a Unified Semantic and ID Representation Learning framework. In our framework, ID tokens capture unique item attributes, while semantic tokens represent shared, transferable characteristics. Our framework integrates cosine similarity in earlier layers and Euclidean distance in the final layer to optimize representation learning.
arXiv Detail & Related papers (2025-02-23T07:17:28Z)
SimMatchV2: Semi-Supervised Learning with Graph Consistency [53.31681712576555]
We introduce a new semi-supervised learning algorithm - SimMatchV2. It formulates various consistency regularizations between labeled and unlabeled data from the graph perspective. SimMatchV2 has been validated on multiple semi-supervised learning benchmarks.
arXiv Detail & Related papers (2023-08-13T05:56:36Z)
VacancySBERT: the approach for representation of titles and skills for semantic similarity search in the recruitment domain [0.0]
The paper focuses on deep learning semantic search algorithms applied in the HR domain. The aim of the article is developing a novel approach to training a Siamese network to link the skills mentioned in the job ad with the title.
arXiv Detail & Related papers (2023-07-31T13:21:15Z)
Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task. We study three unsupervised approaches that rely on a masked language model. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z)
Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset. We show results for shallow fusion contextual biasing applied to two different decoding algorithms. We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z)
Cross-domain Speech Recognition with Unsupervised Character-level Distribution Matching [60.8427677151492]
We propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains. Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR.
arXiv Detail & Related papers (2021-04-15T14:36:54Z)
R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching. We first employ BERT to encode the input sentences from a global perspective. Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective. To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z)
CoMatch: Semi-supervised Learning with Contrastive Graph Regularization [86.84486065798735]
CoMatch is a new semi-supervised learning method that unifies dominant approaches. It achieves state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2020-11-23T02:54:57Z)
UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection [5.099262949886174]
This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings.
arXiv Detail & Related papers (2020-04-30T18:43:57Z)
Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. We formulate the extractive summarization task as a semantic text matching problem. We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.