Related papers: Characterizing instance hardness in classification and regression problems

Characterizing instance hardness in classification and regression problems

URL: http://arxiv.org/abs/2212.01897v1
Date: Sun, 4 Dec 2022 19:16:43 GMT
Title: Characterizing instance hardness in classification and regression problems
Authors: Gustavo P. Torquette and Victor S. Nunes and Pedro Y. A. Paiva and Louren\c{c}o B. C. Neto and Ana C. Lorena
Abstract summary: This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately. Both classification and regression problems are considered. A Python package containing all implementations is also provided.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Some recent pieces of work in the Machine Learning (ML) literature have demonstrated the usefulness of assessing which observations are hardest to have their label predicted accurately. By identifying such instances, one may inspect whether they have any quality issues that should be addressed. Learning strategies based on the difficulty level of the observations can also be devised. This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately and why they are so, aka instance hardness measures. Both classification and regression problems are considered. Synthetic datasets with different levels of complexity are built and analyzed. A Python package containing all implementations is also provided.

Related papers

Nearly Optimal Sample Complexity for Learning with Label Proportions [54.67830198790247]
We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags.<n>Despite the partial observability, the goal is still to achieve small regret at the level of individual examples.<n>We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal.
arXiv Detail & Related papers (2025-05-08T15:45:23Z)
Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes [22.45812577928658]
We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes. This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels.
arXiv Detail & Related papers (2024-12-03T17:29:00Z)
Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism [4.675583319625962]
Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models. It can be affected by the presence of informative'' labels, which occur when some classes are more likely to be labeled than others. We propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm.
arXiv Detail & Related papers (2023-02-15T09:18:46Z)
HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques [48.82319198853359]
HardVis is a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Users can explore subsets of data from different perspectives to decide all those parameters. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case.
arXiv Detail & Related papers (2022-03-29T17:04:16Z)
Robust Deep Semi-Supervised Learning: A Brief Introduction [63.09703308309176]
Semi-supervised learning (SSL) aims to improve learning performance by leveraging unlabeled data when labels are insufficient. SSL with deep models has proven to be successful on standard benchmark tasks. However, they are still vulnerable to various robustness threats in real-world applications.
arXiv Detail & Related papers (2022-02-12T04:16:41Z)
PyHard: a novel tool for generating hardness embeddings to support data-centric analysis [0.38233569758620045]
PyHard produces a hardness embedding of a dataset relating the predictive performance of multiple ML models. The user can interact with this embedding in multiple ways to obtain useful insights about data and algorithmic performance. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models.
arXiv Detail & Related papers (2021-09-29T14:08:26Z)
Learning to Aggregate and Refine Noisy Labels for Visual Sentiment Analysis [69.48582264712854]
We propose a robust learning method to perform robust visual sentiment analysis. Our method relies on an external memory to aggregate and filter noisy labels during training. We establish a benchmark for visual sentiment analysis with label noise using publicly available datasets.
arXiv Detail & Related papers (2021-09-15T18:18:28Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning? [53.523017945443115]
We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. Our results do not depend on the training algorithm or the class of models used for learning.
arXiv Detail & Related papers (2020-12-11T15:25:14Z)
Geometry matters: Exploring language examples at the decision boundary [2.7249290070320034]
BERT, CNN and fasttext are susceptible to word substitutions in high difficulty examples. On YelpReviewPolarity we observe a correlation coefficient of -0.4 between resilience to perturbations and the difficulty score. Our approach is simple, architecture agnostic and can be used to study the fragilities of text classification models.
arXiv Detail & Related papers (2020-10-14T16:26:13Z)
Analysis of label noise in graph-based semi-supervised learning [2.4366811507669124]
In machine learning, one must acquire labels to help supervise a model that will be able to generalize to unseen data. It is often the case that most of our data is unlabeled. Semi-supervised learning (SSL) alleviates that by making strong assumptions about the relation between the labels and the input data distribution.
arXiv Detail & Related papers (2020-09-27T22:13:20Z)
Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect [9.666866159867444]
This research work focuses on revisiting complexity metrics based on data morphology. Being based on ball coverage by classes, they are named after Overlap Number of Balls.
arXiv Detail & Related papers (2020-07-15T18:21:13Z)
Structured Prediction with Partial Labelling through the Infimum Loss [85.4940853372503]
The goal of weak supervision is to enable models to learn using only forms of labelling which are cheaper to collect. This is a type of incomplete annotation where, for each datapoint, supervision is cast as a set of labels containing the real one. This paper provides a unified framework based on structured prediction and on the concept of infimum loss to deal with partial labelling.
arXiv Detail & Related papers (2020-03-02T13:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.