GenToC: Leveraging Partially-Labeled Data for Product Attribute-Value Identification
- URL: http://arxiv.org/abs/2405.10918v1
- Date: Fri, 17 May 2024 17:09:45 GMT
- Title: GenToC: Leveraging Partially-Labeled Data for Product Attribute-Value Identification
- Authors: D. Subhalingam, Keshav Kolluru, Mausam, Saurabh Singal,
- Abstract summary: GenToC is a novel model for extracting attribute-value pairs from product titles.
It is successfully integrated into India's largest B2B e-commerce platform, IndiaMART.com.
- Score: 23.72090728600973
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the e-commerce domain, the accurate extraction of attribute-value pairs from product listings (e.g., Brand: Apple) is crucial for enhancing search and recommendation systems. The automation of this extraction process is challenging due to the vast diversity of product categories and their respective attributes, compounded by the lack of extensive, accurately annotated training datasets and the demand for low latency to meet the real-time needs of e-commerce platforms. To address these challenges, we introduce GenToC, a novel two-stage model for extracting attribute-value pairs from product titles. GenToC is designed to train with partially-labeled data, leveraging incomplete attribute-value pairs and obviating the need for a fully annotated dataset. Moreover, we introduce a bootstrapping method that enables GenToC to progressively refine and expand its training dataset. This enhancement substantially improves the quality of data available for training other neural network models that are typically faster but are inherently less capable than GenToC in terms of their capacity to handle partially-labeled data. By supplying an enriched dataset for training, GenToC significantly advances the performance of these alternative models, making them more suitable for real-time deployment. Our results highlight the unique capability of GenToC to learn from a limited set of labeled data and to contribute to the training of more efficient models, marking a significant leap forward in the automated extraction of attribute-value pairs from product titles. GenToC has been successfully integrated into India's largest B2B e-commerce platform, IndiaMART.com, achieving a significant increase of 21.1% in recall over the existing deployed system while maintaining a high precision of 89.5% in this challenging task.
Related papers
- Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD)
CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z) - A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets [0.0]
This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data.
We use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects.
The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning.
arXiv Detail & Related papers (2024-09-09T18:10:05Z) - Predictive Analytics of Varieties of Potatoes [2.336821989135698]
We explore the application of machine learning algorithms specifically to enhance the selection process of Russet potato clones in breeding trials.
This study addresses the challenge of efficiently identifying high-yield, disease-resistant, and climate-resilient potato varieties.
arXiv Detail & Related papers (2024-04-04T00:49:05Z) - Enhanced E-Commerce Attribute Extraction: Innovating with Decorative
Relation Correction and LLAMA 2.0-Based Annotation [4.81846973621209]
We propose a pioneering framework that integrates BERT for classification, a Conditional Random Fields (CRFs) layer for attribute value extraction, and Large Language Models (LLMs) for data annotation.
Our approach capitalizes on the robust representation learning of BERT, synergized with the sequence decoding prowess of CRFs, to adeptly identify and extract attribute values.
Our methodology is rigorously validated on various datasets, including Walmart, BestBuy's e-commerce NER dataset, and the CoNLL dataset.
arXiv Detail & Related papers (2023-12-09T08:26:30Z) - JPAVE: A Generation and Classification-based Model for Joint Product
Attribute Prediction and Value Extraction [59.94977231327573]
We propose a multi-task learning model with value generation/classification and attribute prediction called JPAVE.
Two variants of our model are designed for open-world and closed-world scenarios.
Experimental results on a public dataset demonstrate the superiority of our model compared with strong baselines.
arXiv Detail & Related papers (2023-11-07T18:36:16Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Benchmarking and Analyzing Generative Data for Visual Recognition [95.69499648941196]
This work delves into the impact of generative images, primarily comparing paradigms that harness external data.<n>We devise textbfGenBench, a benchmark comprising 22 datasets with 2548 categories, to appraise generative data across various visual recognition tasks.<n>Our exhaustive benchmark and analysis spotlight generative data's promise in visual recognition, while identifying key challenges for future investigation.
arXiv Detail & Related papers (2023-07-25T17:59:59Z) - Learning brain MRI quality control: a multi-factorial generalization
problem [0.0]
This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets.
We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them.
We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data.
arXiv Detail & Related papers (2022-05-31T15:46:44Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Ego2Hands: A Dataset for Egocentric Two-hand Segmentation and Detection [1.0742675209112622]
We present Ego2Hands, a large-scale RGB-based egocentric hand segmentation/detection dataset that is semi-automatically annotated.
For quantitative analysis, we manually annotated an evaluation set that significantly exceeds existing benchmarks in quantity, diversity and annotation accuracy.
arXiv Detail & Related papers (2020-11-14T10:12:35Z) - Joint Item Recommendation and Attribute Inference: An Adaptive Graph
Convolutional Network Approach [61.2786065744784]
In recommender systems, users and items are associated with attributes, and users show preferences to items.
As annotating user (item) attributes is a labor intensive task, the attribute values are often incomplete with many missing attribute values.
We propose an Adaptive Graph Convolutional Network (AGCN) approach for joint item recommendation and attribute inference.
arXiv Detail & Related papers (2020-05-25T10:50:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.