Related papers: On the Limitations of Simulating Active Learning

On the Limitations of Simulating Active Learning

URL: http://arxiv.org/abs/2305.13342v1
Date: Sun, 21 May 2023 22:52:13 GMT
Title: On the Limitations of Simulating Active Learning
Authors: Katerina Margatina and Nikolaos Aletras
Abstract summary: Active learning (AL) is a human-and-model-in-the-loop paradigm that iteratively selects informative unlabeled data for human annotation. An easy fix to this impediment is to simulate AL, by treating an already labeled and publicly available dataset as the pool of unlabeled data. We argue that evaluating AL algorithms on available labeled datasets might provide a lower bound as to their effectiveness in real data.
Score: 32.34440406689871
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Active learning (AL) is a human-and-model-in-the-loop paradigm that iteratively selects informative unlabeled data for human annotation, aiming to improve over random sampling. However, performing AL experiments with human annotations on-the-fly is a laborious and expensive process, thus unrealistic for academic research. An easy fix to this impediment is to simulate AL, by treating an already labeled and publicly available dataset as the pool of unlabeled data. In this position paper, we first survey recent literature and highlight the challenges across all different steps within the AL loop. We further unveil neglected caveats in the experimental setup that can significantly affect the quality of AL research. We continue with an exploration of how the simulation setting can govern empirical findings, arguing that it might be one of the answers behind the ever posed question ``why do active learning algorithms sometimes fail to outperform random sampling?''. We argue that evaluating AL algorithms on available labeled datasets might provide a lower bound as to their effectiveness in real data. We believe it is essential to collectively shape the best practices for AL research, particularly as engineering advancements in LLMs push the research focus towards data-driven approaches (e.g., data efficiency, alignment, fairness). In light of this, we have developed guidelines for future work. Our aim is to draw attention to these limitations within the community, in the hope of finding ways to address them.

Related papers

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
MyriadAL: Active Few Shot Learning for Histopathology [10.652626309100889]
We introduce an active few shot learning framework, Myriad Active Learning (MAL) MAL includes a contrastive-learning encoder, pseudo-label generation, and novel query sample selection in the loop. Experiments on two public histopathology datasets show that MAL has superior test accuracy, macro F1-score, and label efficiency compared to prior works.
arXiv Detail & Related papers (2023-10-24T20:08:15Z)
Improving and Benchmarking Offline Reinforcement Learning Algorithms [87.67996706673674]
This work aims to bridge the gaps caused by low-level choices and datasets. We empirically investigate 20 implementation choices using three representative algorithms. We find two variants CRR+ and CQL+ achieving new state-of-the-art on D4RL.
arXiv Detail & Related papers (2023-06-01T17:58:46Z)
Navigating the Pitfalls of Active Learning Evaluation: A Systematic Framework for Meaningful Performance Assessment [3.3064235071867856]
Active Learning (AL) aims to reduce the labeling burden by interactively selecting the most informative samples from a pool of unlabeled data. Some studies have questioned the effectiveness of AL compared to emerging paradigms such as semi-supervised (Semi-SL) and self-supervised learning (Self-SL)
arXiv Detail & Related papers (2023-01-25T15:07:44Z)
Is margin all you need? An extensive empirical study of active learning on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z)
Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z)
A Comparative Survey of Deep Active Learning [76.04825433362709]
Active Learning (AL) is a set of techniques for reducing labeling cost by sequentially selecting data samples from a large unlabeled data pool for labeling. Deep Learning (DL) is data-hungry, and the performance of DL models scales monotonically with more training data. In recent years, Deep Active Learning (DAL) has risen as feasible solutions for maximizing model performance while minimizing the expensive labeling cost.
arXiv Detail & Related papers (2022-03-25T05:17:24Z)
Active Learning-Based Optimization of Scientific Experimental Design [1.9705094859539976]
Active learning (AL) is a machine learning algorithm that can achieve greater accuracy with fewer labeled training instances. This article performs a retrospective study on a drug response dataset using the proposed AL scheme. It shows that scientific experimental design, instead of being manually set, can be optimized by AL.
arXiv Detail & Related papers (2021-12-29T20:02:35Z)
Active learning for reducing labeling effort in text classification tasks [3.8424737607413153]
Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative. We present an empirical study that compares different uncertainty-based algorithms BERT$_base$ as the used classifiers. Our results show that using uncertainty-based AL with BERT$base$ outperforms random sampling of data.
arXiv Detail & Related papers (2021-09-10T13:00:36Z)
MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems. We propose a novel method for computing the normalized maximum likelihood (NML) distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z)
Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning. We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class. We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z)
Towards Robust and Reproducible Active Learning Using Neural Networks [15.696979318409392]
Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data. Recently proposed neural network based AL methods help reduce annotation cost in domains where labeling data can be prohibitive. In this study, we demonstrate that different types of AL algorithms produce inconsistent gain over random sampling baseline.
arXiv Detail & Related papers (2020-02-21T22:01:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.