Is margin all you need? An extensive empirical study of active learning
on tabular data
- URL: http://arxiv.org/abs/2210.03822v1
- Date: Fri, 7 Oct 2022 21:18:24 GMT
- Title: Is margin all you need? An extensive empirical study of active learning
on tabular data
- Authors: Dara Bahri, Heinrich Jiang, Tal Schuster, Afshin Rostamizadeh
- Abstract summary: We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
- Score: 66.18464006872345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a labeled training set and a collection of unlabeled data, the goal of
active learning (AL) is to identify the best unlabeled points to label. In this
comprehensive study, we analyze the performance of a variety of AL algorithms
on deep neural networks trained on 69 real-world tabular classification
datasets from the OpenML-CC18 benchmark. We consider different data regimes and
the effect of self-supervised model pre-training. Surprisingly, we find that
the classical margin sampling technique matches or outperforms all others,
including current state-of-art, in a wide range of experimental settings. To
researchers, we hope to encourage rigorous benchmarking against margin, and to
practitioners facing tabular data labeling constraints that
hyper-parameter-free margin may often be all they need.
Related papers
- Inconsistency Masks: Removing the Uncertainty from Input-Pseudo-Label Pairs [0.0]
Inconsistency Masks (IM) is a novel approach that filters uncertainty in image-pseudo-label pairs to substantially enhance segmentation quality.
We achieve strong segmentation results with as little as 10% labeled data, across four diverse datasets.
Three of our hybrid approaches even outperform models trained on the fully labeled dataset.
arXiv Detail & Related papers (2024-01-25T18:46:35Z) - Memory Consistency Guided Divide-and-Conquer Learning for Generalized
Category Discovery [56.172872410834664]
Generalized category discovery (GCD) aims at addressing a more realistic and challenging setting of semi-supervised learning.
We propose a Memory Consistency guided Divide-and-conquer Learning framework (MCDL)
Our method outperforms state-of-the-art models by a large margin on both seen and unseen classes of the generic image recognition.
arXiv Detail & Related papers (2024-01-24T09:39:45Z) - Learning from the Best: Active Learning for Wireless Communications [9.523381807291049]
Active learning algorithms identify the most critical and informative samples in an unlabeled dataset and label only those samples, instead of the complete set.
We present a case study of deep learning-based mmWave beam selection, where labeling is performed by a compute-intensive algorithm based on exhaustive search.
Our results show that using an active learning algorithm for class-imbalanced datasets can reduce labeling overhead by up to 50% for this dataset.
arXiv Detail & Related papers (2024-01-23T12:21:57Z) - MyriadAL: Active Few Shot Learning for Histopathology [10.652626309100889]
We introduce an active few shot learning framework, Myriad Active Learning (MAL)
MAL includes a contrastive-learning encoder, pseudo-label generation, and novel query sample selection in the loop.
Experiments on two public histopathology datasets show that MAL has superior test accuracy, macro F1-score, and label efficiency compared to prior works.
arXiv Detail & Related papers (2023-10-24T20:08:15Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - SCARF: Self-Supervised Contrastive Learning using Random Feature
Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features.
We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z) - R\'{e}nyi Entropy Bounds on the Active Learning Cost-Performance
Tradeoff [27.436483977171328]
Semi-supervised classification studies how to combine the statistical knowledge of the often abundant unlabeled data with the often limited labeled data in order to maximize overall classification accuracy.
In this paper, we initiate the non-asymptotic analysis of the optimal policy for semi-supervised classification with actively obtained labeled data.
We provide the first characterization of the jointly optimal active learning and semi-supervised classification policy, in terms of the cost-performance tradeoff driven by the label query budget and overall classification accuracy.
arXiv Detail & Related papers (2020-02-05T22:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.