LaTeX-Numeric: Language-agnostic Text attribute eXtraction for
E-commerce Numeric Attributes
- URL: http://arxiv.org/abs/2104.09576v1
- Date: Mon, 19 Apr 2021 19:14:32 GMT
- Title: LaTeX-Numeric: Language-agnostic Text attribute eXtraction for
E-commerce Numeric Attributes
- Authors: Kartik Mehta, Ioana Oprea and Nikhil Rasiwasia
- Abstract summary: We present high-precision fully-automated scalable framework for extracting E-commerce numeric attributes from product text.
We propose a multi-task architecture to deal with missing labels in attribute data, leading to F1 improvement of 9.2% for numeric attributes over single-task architecture.
We propose an automated algorithm for alias creation using attribute values, leading to a 20.2% F1 improvement.
- Score: 0.25782420501870296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present LaTeX-Numeric - a high-precision fully-automated
scalable framework for extracting E-commerce numeric attributes from product
text like product description. Most of the past work on attribute extraction is
not scalable as they rely on manually curated training data, either with or
without the use of active learning. We rely on distant supervision for training
data generation, removing dependency on manual labels. One issue with distant
supervision is that it leads to incomplete training annotation due to missing
attribute values while matching. We propose a multi-task learning architecture
to deal with missing labels in the training data, leading to F1 improvement of
9.2% for numeric attributes over single-task architecture. While multi-task
architecture benefits both numeric and non-numeric attributes, we present
automated techniques to further improve the numeric attributes extraction
models. Numeric attributes require a list of units (or aliases) for better
matching with distant supervision. We propose an automated algorithm for alias
creation using product text and attribute values, leading to a 20.2% F1
improvement. Extensive experiments on real world dataset for 20 numeric
attributes across 5 product categories and 3 English marketplaces show that
LaTeX-Numeric achieves a high F1-score, without any manual intervention, making
it suitable for practical applications. Finally, we show that the improvements
are language-agnostic and LaTeX-Numeric achieves 13.9% F1 improvement for 3
Romance languages.
Related papers
- AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale.
Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements.
We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z) - Self-Refinement Strategies for LLM-based Product Attribute Value Extraction [51.45146101802871]
This paper investigates applying two self-refinement techniques to the product attribute value extraction task.
The experiments show that both self-refinement techniques fail to significantly improve the extraction performance while substantially increasing processing costs.
For scenarios with development data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.
arXiv Detail & Related papers (2025-01-02T12:55:27Z) - MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z) - Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts [13.789739307267952]
We present Autonomous Data Selection (AutoDS), a method that automatically curates high-quality mathematical texts.
Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits.
We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation.
arXiv Detail & Related papers (2024-02-12T13:09:21Z) - NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets.
We employ an external datastore for retrieving similar skills in a dataset-unifying manner.
We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - Product Information Extraction using ChatGPT [69.12244027050454]
This paper explores the potential of ChatGPT for extracting attribute/value pairs from product descriptions.
Our results show that ChatGPT achieves a performance similar to a pre-trained language model but requires much smaller amounts of training data and computation for fine-tuning.
arXiv Detail & Related papers (2023-06-23T09:30:01Z) - Large Scale Generative Multimodal Attribute Extraction for E-commerce
Attributes [23.105116746332506]
E-commerce websites (e.g. Amazon) have a plethora of structured and unstructured information (text and images) present on the product pages.
Sellers often either don't label or mislabel values of the attributes (e.g. color, size etc.) for their products.
We present a scalable solution for this problem using textbfMXT, consisting of three key components.
arXiv Detail & Related papers (2023-06-01T06:21:45Z) - Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z) - AIFB-WebScience at SemEval-2022 Task 12: Relation Extraction First --
Using Relation Extraction to Identify Entities [0.0]
We present an end-to-end joint entity and relation extraction approach based on transformer-based language models.
In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entity extraction.
arXiv Detail & Related papers (2022-03-10T12:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.