Related papers: Imputation using training labels and classification via label imputation

Imputation using training labels and classification via label imputation

URL: http://arxiv.org/abs/2311.16877v4
Date: Fri, 25 Oct 2024 06:44:08 GMT
Title: Imputation using training labels and classification via label imputation
Authors: Thu Nguyen, Tuan L. Vo, Pål Halvorsen, Michael A. Riegler,
Abstract summary: We propose Classification Based on MissForest Imputation to deal with missing data. CBMI stacks the predicted test label with missing values and stacks the label with the input for imputation. CBMI consistently shows significantly better results than imputation based on only the input data.
Score: 4.387724419358174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Missing data is a common problem in practical data science settings. Various imputation methods have been developed to deal with missing data. However, even though the labels are available in the training data in many situations, the common practice of imputation usually only relies on the input and ignores the label. We propose Classification Based on MissForest Imputation (CBMI), a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation, allowing the label and the input to be imputed simultaneously. In addition, we propose the imputation using labels (IUL) algorithm, an imputation strategy that stacks the label into the input and illustrates how it can significantly improve the imputation quality. Experiments show that CBMI has classification accuracy when the test set contains missing data, especially for imbalanced data and categorical data. Moreover, for both the regression and classification, IUL consistently shows significantly better results than imputation based on only the input data.

Related papers

Evaluating multiple models using labeled and unlabeled data [8.174722982389259]
Semi-Supervised Model Evaluation (SSME) is a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method.
arXiv Detail & Related papers (2025-01-21T03:47:37Z)
Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels [6.872072177648135]
We propose a novel localization algorithm that adapts well-established ground truth estimation methods. Our algorithm also shows superior performance during training on the TexBiG dataset.
arXiv Detail & Related papers (2023-09-18T13:08:44Z)
Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training. We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data. Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z)
Partial-Label Regression [54.74984751371617]
Partial-label learning is a weakly supervised learning setting that allows each training example to be annotated with a set of candidate labels. Previous studies on partial-label learning only focused on the classification setting where candidate labels are all discrete. In this paper, we provide the first attempt to investigate partial-label regression, where each training example is annotated with a set of real-valued candidate labels.
arXiv Detail & Related papers (2023-06-15T09:02:24Z)
Detecting Label Errors in Token Classification Data [22.539748563923123]
We consider the task of finding sentences that contain label errors in token classification datasets. We study 11 different straightforward methods that score tokens/sentences based on the predicted class probabilities. We identify a simple and effective method that consistently detects those sentences containing label errors when applied with different token classification models.
arXiv Detail & Related papers (2022-10-08T05:14:22Z)
Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data. We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z)
Instance Correction for Learning with Open-set Noisy Labels [145.06552420999986]
We use the sample selection approach to handle open-set noisy labels. The discarded data are seen to be mislabeled and do not participate in training. We modify the instances of discarded data to make predictions for the discarded data consistent with given labels.
arXiv Detail & Related papers (2021-06-01T13:05:55Z)
Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces [64.23172847182109]
We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels. We provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance.
arXiv Detail & Related papers (2021-05-12T15:40:13Z)
Harmless label noise and informative soft-labels in supervised classification [1.6752182911522517]
Manual labelling of training examples is common practice in supervised learning. When the labelling task is of non-trivial difficulty, the supplied labels may not be equal to the ground-truth labels, and label noise is introduced into the training dataset. In particular, when classification difficulty is the only source of label errors, multiple sets of noisy labels can supply more information for the estimation of a classification rule.
arXiv Detail & Related papers (2021-04-07T02:56:11Z)
Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning. The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.