Related papers: Improve Cost Efficiency of Active Learning over Noisy Dataset

Improve Cost Efficiency of Active Learning over Noisy Dataset

URL: http://arxiv.org/abs/2403.01346v1
Date: Sat, 2 Mar 2024 23:53:24 GMT
Title: Improve Cost Efficiency of Active Learning over Noisy Dataset
Authors: Zan-Kai Chong, Hiroyuki Ohsaki, and Bryan Ng
Abstract summary: In this paper, we consider cases of binary classification, where acquiring a positive instance incurs a significantly higher cost compared to that of negative instances. We propose a shifted normal distribution sampling function that samples from a wider range than typical uncertainty sampling. Our simulation underscores that our proposed sampling function limits both noisy and positive label selection, delivering between 20% and 32% improved cost efficiency over different test datasets.
Score: 1.3846014191157405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Active learning is a learning strategy whereby the machine learning algorithm actively identifies and labels data points to optimize its learning. This strategy is particularly effective in domains where an abundance of unlabeled data exists, but the cost of labeling these data points is prohibitively expensive. In this paper, we consider cases of binary classification, where acquiring a positive instance incurs a significantly higher cost compared to that of negative instances. For example, in the financial industry, such as in money-lending businesses, a defaulted loan constitutes a positive event leading to substantial financial loss. To address this issue, we propose a shifted normal distribution sampling function that samples from a wider range than typical uncertainty sampling. Our simulation underscores that our proposed sampling function limits both noisy and positive label selection, delivering between 20% and 32% improved cost efficiency over different test datasets.

Related papers

Propensity-driven Uncertainty Learning for Sample Exploration in Source-Free Active Domain Adaptation [19.620523416385346]
Source-free active domain adaptation (SFADA) addresses the challenge of adapting a pre-trained model to new domains without access to source data. This scenario is particularly relevant in real-world applications where data privacy, storage limitations, or labeling costs are significant concerns. We propose the Propensity-driven Uncertainty Learning (ProULearn) framework to effectively select more informative samples without frequently requesting human annotations.
arXiv Detail & Related papers (2025-01-23T10:05:25Z)
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective. This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z)
Fair Active Learning in Low-Data Regimes [22.349886628823125]
In machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments. We introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability.
arXiv Detail & Related papers (2023-12-13T23:14:55Z)
Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection [0.8192907805418583]
We present ways of assessing prospective training examples for their "usefulness" or "difficulty" We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.
arXiv Detail & Related papers (2023-11-27T20:33:54Z)
SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation. We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z)
Deep Active Learning with Budget Annotation [0.0]
We propose a hybrid approach of computing both the uncertainty and informativeness of an instance. We employ the state-of-the-art pre-trained models in order to avoid querying information already contained in those models.
arXiv Detail & Related papers (2022-07-31T20:20:44Z)
ALLSH: Active Learning Guided by Local Sensitivity and Hardness [98.61023158378407]
We propose to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function. Our method achieves consistent gains over the commonly used active learning strategies in various classification tasks.
arXiv Detail & Related papers (2022-05-10T15:39:11Z)
Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data. We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z)
A Novel Perspective for Positive-Unlabeled Learning via Noisy Labels [49.990938653249415]
This research presents a methodology that assigns initial pseudo-labels to unlabeled data which is used as noisy-labeled data, and trains a deep neural network using the noisy-labeled data. Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-03-08T11:46:02Z)
Cost-Based Budget Active Learning for Deep Learning [0.9732863739456035]
We propose a Cost-Based Bugdet Active Learning (CBAL) which considers the classification uncertainty as well as instance diversity in a population constrained by a budget. A principled approach based on the min-max is considered to minimize both the labeling and decision cost of the selected instances.
arXiv Detail & Related papers (2020-12-09T17:42:44Z)
Semi-supervised Batch Active Learning via Bilevel Optimization [89.37476066973336]
We formulate our approach as a data summarization problem via bilevel optimization. We show that our method is highly effective in keyword detection tasks in the regime when only few labeled samples are available.
arXiv Detail & Related papers (2020-10-19T16:53:24Z)
Toward Optimal Probabilistic Active Learning Using a Bayesian Approach [4.380488084997317]
Active learning aims at reducing the labeling costs by an efficient and effective allocation of costly labeling resources. By reformulating existing selection strategies within our proposed model, we can explain which aspects are not covered in current state-of-the-art.
arXiv Detail & Related papers (2020-06-02T15:59:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.