From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning
- URL: http://arxiv.org/abs/2408.08981v2
- Date: Thu, 22 Aug 2024 17:20:27 GMT
- Title: From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning
- Authors: Ranran Haoran Zhang, Bensu Uçar, Soumik Dey, Hansi Wu, Binbin Li, Rui Zhang,
- Abstract summary: Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond a $103$ to $1012$ label set.
Self-selection bias in data annotation leads to significant missing labels in both training and test data.
We introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task.
- Score: 5.979699034554517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically $10^3$ to $10^{12}$ labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular inputs. This creates two critical challenges: generation models learn to be "lazy'" by under-generating labels, and evaluation becomes unreliable due to insufficient annotation in the test set. In this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task, addressing the generation model's laziness. Additionally, we propose to adopt a suite of evaluation metrics, F1@$\mathcal{O}$ and newly proposed B@$k$, to reliably assess OXMC models with incomplete ground truths. In a highly imbalanced e-commerce dataset with substantial missing labels, PUSL generates 30% more unique labels, and 72% of its predictions align with actual user queries. On the less skewed EURLex-4.3k dataset, PUSL demonstrates superior F1 scores, especially as label counts increase from 15 to 30. Our approach effectively tackles both the modeling and evaluation challenges in OXMC with missing labels.
Related papers
- Learning label-label correlations in Extreme Multi-label Classification via Label Features [44.00852282861121]
Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices.
Short-text XMC with label features has found numerous applications in areas such as query-to-ad-phrase matching in search ads, title-based product recommendation, prediction of related searches.
We propose Gandalf, a novel approach which makes use of a label co-occurrence graph to leverage label features as additional data points to supplement the training distribution.
arXiv Detail & Related papers (2024-05-03T21:18:43Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Ground Truth Inference for Weakly Supervised Entity Matching [76.6732856489872]
We propose a simple but powerful labeling model for weak supervision tasks.
We then tailor the labeling model specifically to the task of entity matching.
We show that our labeling model results in a 9% higher F1 score on average than the best existing method.
arXiv Detail & Related papers (2022-11-13T17:57:07Z) - Open Vocabulary Extreme Classification Using Generative Models [24.17018785195843]
The extreme multi-label classification (XMC) task aims at tagging content with a subset of labels from an extremely large label set.
We propose GROOV, a fine-tuned seq2seq model for OXMC that generates the set of labels as a flat sequence and is trained using a novel loss independent of predicted label order.
We show the efficacy of the approach, experimenting with popular XMC datasets for which GROOV is able to predict meaningful labels outside the given vocabulary while performing on par with state-of-the-art solutions for known labels.
arXiv Detail & Related papers (2022-05-12T00:33:49Z) - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition [98.25592165484737]
We propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL)
CMPL achieves $17.6%$ and $25.1%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1%$ labeled data, respectively.
arXiv Detail & Related papers (2021-12-17T18:59:41Z) - Extreme Zero-Shot Learning for Extreme Text Classification [80.95271050744624]
Extreme Zero-Shot XMC (EZ-XMC) and Few-Shot XMC (FS-XMC) are investigated.
We propose to pre-train Transformer-based encoders with self-supervised contrastive losses.
We develop a pre-training method MACLR, which thoroughly leverages the raw text with techniques including Multi-scale Adaptive Clustering, Label Regularization, and self-training with pseudo positive pairs.
arXiv Detail & Related papers (2021-12-16T06:06:42Z) - LightXML: Transformer with Dynamic Negative Sampling for
High-Performance Extreme Multi-label Text Classification [27.80266694835677]
Extreme Multi-label text Classification (XMC) is a task of finding the most relevant labels from a large label set.
We propose LightXML, which adopts end-to-end training and dynamic negative labels sampling.
In experiments, LightXML outperforms state-of-the-art methods in five extreme multi-label datasets.
arXiv Detail & Related papers (2021-01-09T07:04:18Z) - A Study on the Autoregressive and non-Autoregressive Multi-label
Learning [77.11075863067131]
We propose a self-attention based variational encoder-model to extract the label-label and label-feature dependencies jointly.
Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature dependencies.
arXiv Detail & Related papers (2020-12-03T05:41:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.