LaTeX-Numeric: Language-agnostic Text attribute eXtraction for
E-commerce Numeric Attributes
- URL: http://arxiv.org/abs/2104.09576v1
- Date: Mon, 19 Apr 2021 19:14:32 GMT
- Title: LaTeX-Numeric: Language-agnostic Text attribute eXtraction for
E-commerce Numeric Attributes
- Authors: Kartik Mehta, Ioana Oprea and Nikhil Rasiwasia
- Abstract summary: We present high-precision fully-automated scalable framework for extracting E-commerce numeric attributes from product text.
We propose a multi-task architecture to deal with missing labels in attribute data, leading to F1 improvement of 9.2% for numeric attributes over single-task architecture.
We propose an automated algorithm for alias creation using attribute values, leading to a 20.2% F1 improvement.
- Score: 0.25782420501870296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present LaTeX-Numeric - a high-precision fully-automated
scalable framework for extracting E-commerce numeric attributes from product
text like product description. Most of the past work on attribute extraction is
not scalable as they rely on manually curated training data, either with or
without the use of active learning. We rely on distant supervision for training
data generation, removing dependency on manual labels. One issue with distant
supervision is that it leads to incomplete training annotation due to missing
attribute values while matching. We propose a multi-task learning architecture
to deal with missing labels in the training data, leading to F1 improvement of
9.2% for numeric attributes over single-task architecture. While multi-task
architecture benefits both numeric and non-numeric attributes, we present
automated techniques to further improve the numeric attributes extraction
models. Numeric attributes require a list of units (or aliases) for better
matching with distant supervision. We propose an automated algorithm for alias
creation using product text and attribute values, leading to a 20.2% F1
improvement. Extensive experiments on real world dataset for 20 numeric
attributes across 5 product categories and 3 English marketplaces show that
LaTeX-Numeric achieves a high F1-score, without any manual intervention, making
it suitable for practical applications. Finally, we show that the improvements
are language-agnostic and LaTeX-Numeric achieves 13.9% F1 improvement for 3
Romance languages.
Related papers
- MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z) - NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets.
We employ an external datastore for retrieving similar skills in a dataset-unifying manner.
We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - Product Information Extraction using ChatGPT [69.12244027050454]
This paper explores the potential of ChatGPT for extracting attribute/value pairs from product descriptions.
Our results show that ChatGPT achieves a performance similar to a pre-trained language model but requires much smaller amounts of training data and computation for fine-tuning.
arXiv Detail & Related papers (2023-06-23T09:30:01Z) - Large Scale Generative Multimodal Attribute Extraction for E-commerce
Attributes [23.105116746332506]
E-commerce websites (e.g. Amazon) have a plethora of structured and unstructured information (text and images) present on the product pages.
Sellers often either don't label or mislabel values of the attributes (e.g. color, size etc.) for their products.
We present a scalable solution for this problem using textbfMXT, consisting of three key components.
arXiv Detail & Related papers (2023-06-01T06:21:45Z) - Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z) - OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak
Supervision [93.26737878221073]
We study the attribute mining problem in an open-world setting to extract novel attributes and their values.
We propose a principled framework that first generates attribute value candidates and then groups them into clusters of attributes.
Our model significantly outperforms strong baselines and can generalize to unseen attributes and product types.
arXiv Detail & Related papers (2022-04-29T04:16:04Z) - AIFB-WebScience at SemEval-2022 Task 12: Relation Extraction First --
Using Relation Extraction to Identify Entities [0.0]
We present an end-to-end joint entity and relation extraction approach based on transformer-based language models.
In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entity extraction.
arXiv Detail & Related papers (2022-03-10T12:19:44Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.