Related papers: Is margin all you need? An extensive empirical study of active learning on tabular data

Is margin all you need? An extensive empirical study of active learning on tabular data

URL: http://arxiv.org/abs/2210.03822v1
Date: Fri, 7 Oct 2022 21:18:24 GMT
Title: Is margin all you need? An extensive empirical study of active learning on tabular data
Authors: Dara Bahri, Heinrich Jiang, Tal Schuster, Afshin Rostamizadeh
Abstract summary: We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
Score: 66.18464006872345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Given a labeled training set and a collection of unlabeled data, the goal of active learning (AL) is to identify the best unlabeled points to label. In this comprehensive study, we analyze the performance of a variety of AL algorithms on deep neural networks trained on 69 real-world tabular classification datasets from the OpenML-CC18 benchmark. We consider different data regimes and the effect of self-supervised model pre-training. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art, in a wide range of experimental settings. To researchers, we hope to encourage rigorous benchmarking against margin, and to practitioners facing tabular data labeling constraints that hyper-parameter-free margin may often be all they need.

Related papers

Pseudo-labelling meets Label Smoothing for Noisy Partial Label Learning [8.387189407144403]
We motivate weakly supervised learning as an effective learning paradigm for problems where curating perfectly annotated datasets is expensive.<n>We focus on Partial Learning (PLL), a weakly-supervised learning paradigm where each training instance is paired with a set of candidate labels.<n>We present a framework that initially assigns pseudo-labels to images by exploiting the noisy partial labels through a weighted nearest neighbour algorithm.
arXiv Detail & Related papers (2024-02-07T13:32:47Z)
Inconsistency Masks: Removing the Uncertainty from Input-Pseudo-Label Pairs [0.0]
Inconsistency Masks (IM) is a novel approach that filters uncertainty in image-pseudo-label pairs to substantially enhance segmentation quality. We achieve strong segmentation results with as little as 10% labeled data, across four diverse datasets. Three of our hybrid approaches even outperform models trained on the fully labeled dataset.
arXiv Detail & Related papers (2024-01-25T18:46:35Z)
Memory Consistency Guided Divide-and-Conquer Learning for Generalized Category Discovery [56.172872410834664]
Generalized category discovery (GCD) aims at addressing a more realistic and challenging setting of semi-supervised learning. We propose a Memory Consistency guided Divide-and-conquer Learning framework (MCDL) Our method outperforms state-of-the-art models by a large margin on both seen and unseen classes of the generic image recognition.
arXiv Detail & Related papers (2024-01-24T09:39:45Z)
Learning from the Best: Active Learning for Wireless Communications [9.523381807291049]
Active learning algorithms identify the most critical and informative samples in an unlabeled dataset and label only those samples, instead of the complete set. We present a case study of deep learning-based mmWave beam selection, where labeling is performed by a compute-intensive algorithm based on exhaustive search. Our results show that using an active learning algorithm for class-imbalanced datasets can reduce labeling overhead by up to 50% for this dataset.
arXiv Detail & Related papers (2024-01-23T12:21:57Z)
Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels. Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
MyriadAL: Active Few Shot Learning for Histopathology [10.652626309100889]
We introduce an active few shot learning framework, Myriad Active Learning (MAL) MAL includes a contrastive-learning encoder, pseudo-label generation, and novel query sample selection in the loop. Experiments on two public histopathology datasets show that MAL has superior test accuracy, macro F1-score, and label efficiency compared to prior works.
arXiv Detail & Related papers (2023-10-24T20:08:15Z)
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning) It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z)
Exploiting Diversity of Unlabeled Data for Label-Efficient Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling. We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting. Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z)
Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data. We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z)
SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z)
R\'{e}nyi Entropy Bounds on the Active Learning Cost-Performance Tradeoff [27.436483977171328]
Semi-supervised classification studies how to combine the statistical knowledge of the often abundant unlabeled data with the often limited labeled data in order to maximize overall classification accuracy. In this paper, we initiate the non-asymptotic analysis of the optimal policy for semi-supervised classification with actively obtained labeled data. We provide the first characterization of the jointly optimal active learning and semi-supervised classification policy, in terms of the cost-performance tradeoff driven by the label query budget and overall classification accuracy.
arXiv Detail & Related papers (2020-02-05T22:38:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.