Related papers: Training Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks

Training Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks

URL: http://arxiv.org/abs/2409.08647v1
Date: Fri, 13 Sep 2024 09:09:24 GMT
Title: Training Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks
Authors: Anita Eisenbürger, Daniel Otten, Anselm Hudde, Frank Hopfgartner,
Abstract summary: This study aims to investigate the effects of label noise on gradient-boosted decision trees and methods to mitigate those effects. The implemented methods demonstrate state-of-the-art noise detection performance on the Adult dataset and achieve the highest classification precision and recall on the Adult and Breast Cancer datasets.
Score: 1.261491746208123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Label noise refers to the phenomenon where instances in a data set are assigned to the wrong label. Label noise is harmful to classifier performance, increases model complexity and impairs feature selection. Addressing label noise is crucial, yet current research primarily focuses on image and text data using deep neural networks. This leaves a gap in the study of tabular data and gradient-boosted decision trees (GBDTs), the leading algorithm for tabular data. Different methods have already been developed which either try to filter label noise, model label noise while simultaneously training a classifier or use learning algorithms which remain effective even if label noise is present. This study aims to further investigate the effects of label noise on gradient-boosted decision trees and methods to mitigate those effects. Through comprehensive experiments and analysis, the implemented methods demonstrate state-of-the-art noise detection performance on the Adult dataset and achieve the highest classification precision and recall on the Adult and Breast Cancer datasets, respectively. In summary, this paper enhances the understanding of the impact of label noise on GBDTs and lays the groundwork for future research in noise detection and correction methods.

Related papers

Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection [84.78475642696137]
The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. We propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS) SGPS constructs reliable positive pairs for noisy samples to enhance the sample utilization.
arXiv Detail & Related papers (2025-01-19T14:41:55Z)
NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification [7.464154519547575]
Existing research on learning with noisy labels predominantly focuses on synthetic noise patterns. We constructed a benchmark dataset to better understand label noise in real-world text classification settings. Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise.
arXiv Detail & Related papers (2024-07-09T06:18:40Z)
Extracting Clean and Balanced Subset for Noisy Long-tailed Classification [66.47809135771698]
We develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching. By setting a manually-specific probability measure, we can reduce the side-effects of noisy and long-tailed data simultaneously. Our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.
arXiv Detail & Related papers (2024-04-10T07:34:37Z)
Group Benefits Instances Selection for Data Purification [21.977432359384835]
Existing methods for combating label noise are typically designed and tested on synthetic datasets. We propose a method named GRIP to alleviate the noisy label problem for both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-03-23T03:06:19Z)
SoftPatch: Unsupervised Anomaly Detection with Noisy Data [67.38948127630644]
This paper considers label-level noise in image sensory anomaly detection for the first time. We propose a memory-based unsupervised AD method, SoftPatch, which efficiently denoises the data at the patch level. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset.
arXiv Detail & Related papers (2024-03-21T08:49:34Z)
Handling Realistic Label Noise in BERT Text Classification [1.0515439489916731]
Real label noise is not random; rather, it is often correlated with input features or other annotator-specific factors. We show that the presence of these types of noise significantly degrades BERT classification performance.
arXiv Detail & Related papers (2023-05-23T18:30:31Z)
Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features [43.41573458276422]
We introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network. The proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation.
arXiv Detail & Related papers (2022-12-19T09:39:30Z)
Noisy Label Classification using Label Noise Selection with Test-Time Augmentation Cross-Entropy and NoiseMix Learning [22.02829139522153]
We propose a method of learning noisy label data using the label noise selection with test-time augmentation (TTA) cross-entropy and classifier learning with the NoiseMix method. In experiments on the ISIC-18 public skin lesion diagnosis dataset, the proposed TTA cross-entropy outperformed the conventional cross-entropy and the TTA uncertainty in detecting label noise data.
arXiv Detail & Related papers (2022-12-01T13:05:20Z)
Towards Harnessing Feature Embedding for Robust Learning with Noisy Labels [44.133307197696446]
The memorization effect of deep neural networks (DNNs) plays a pivotal role in recent label noise learning methods. We propose a novel feature embedding-based method for deep learning with label noise, termed LabEl NoiseDilution (LEND)
arXiv Detail & Related papers (2022-06-27T02:45:09Z)
Robust Meta-learning with Sampling Noise and Label Noise via Eigen-Reptile [78.1212767880785]
meta-learner is prone to overfitting since there are only a few available samples. When handling the data with noisy labels, the meta-learner could be extremely sensitive to label noise. We present Eigen-Reptile (ER) that updates the meta- parameters with the main direction of historical task-specific parameters.
arXiv Detail & Related papers (2022-06-04T08:48:02Z)
Denoising Distantly Supervised Named Entity Recognition via a Hypergeometric Probabilistic Model [26.76830553508229]
Hypergeometric Learning (HGL) is a denoising algorithm for distantly supervised named entity recognition. HGL takes both noise distribution and instance-level confidence into consideration. Experiments show that HGL can effectively denoise the weakly-labeled data retrieved from distant supervision.
arXiv Detail & Related papers (2021-06-17T04:01:25Z)
Training Classifiers that are Universally Robust to All Label Noise Levels [91.13870793906968]
Deep neural networks are prone to overfitting in the presence of label noise. We propose a distillation-based framework that incorporates a new subcategory of Positive-Unlabeled learning. Our framework generally outperforms at medium to high noise levels.
arXiv Detail & Related papers (2021-05-27T13:49:31Z)
Noise-resistant Deep Metric Learning with Ranking-based Instance Selection [59.286567680389766]
We propose a noise-resistant training technique for DML, which we name Probabilistic Ranking-based Instance Selection with Memory (PRISM) PRISM identifies noisy data in a minibatch using average similarity against image features extracted from several previous versions of the neural network. To alleviate the high computational cost brought by the memory bank, we introduce an acceleration method that replaces individual data points with the class centers.
arXiv Detail & Related papers (2021-03-30T03:22:17Z)
Improving Medical Image Classification with Label Noise Using Dual-uncertainty Estimation [72.0276067144762]
We discuss and define the two common types of label noise in medical images. We propose an uncertainty estimation-based framework to handle these two label noise amid the medical image classification task.
arXiv Detail & Related papers (2021-02-28T14:56:45Z)
Towards Robustness to Label Noise in Text Classification via Noise Modeling [7.863638253070439]
Large datasets in NLP suffer from noisy labels, due to erroneous automatic and human annotation procedures. We study the problem of text classification with label noise, and aim to capture this noise through an auxiliary noise model over the classifier.
arXiv Detail & Related papers (2021-01-27T05:41:57Z)
Tackling Instance-Dependent Label Noise via a Universal Probabilistic Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances. Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z)
Rectified Meta-Learning from Noisy Labels for Robust Image-based Plant Disease Diagnosis [64.82680813427054]
Plant diseases serve as one of main threats to food security and crop production. One popular approach is to transform this problem as a leaf image classification task, which can be addressed by the powerful convolutional neural networks (CNNs) We propose a novel framework that incorporates rectified meta-learning module into common CNN paradigm to train a noise-robust deep network without using extra supervision information.
arXiv Detail & Related papers (2020-03-17T09:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.