Data Shapley Valuation for Efficient Batch Active Learning
- URL: http://arxiv.org/abs/2104.08312v1
- Date: Fri, 16 Apr 2021 18:53:42 GMT
- Title: Data Shapley Valuation for Efficient Batch Active Learning
- Authors: Amirata Ghorbani, James Zou, Andre Esteva
- Abstract summary: Active Data Shapley (ADS) is a filtering layer for batch active learning.
We show that ADS is particularly effective when the pool of unlabeled data exhibits real-world caveats.
- Score: 21.76249748709411
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Annotating the right set of data amongst all available data points is a key
challenge in many machine learning applications. Batch active learning is a
popular approach to address this, in which batches of unlabeled data points are
selected for annotation, while an underlying learning algorithm gets
subsequently updated. Increasingly larger batches are particularly appealing in
settings where data can be annotated in parallel, and model training is
computationally expensive. A key challenge here is scale - typical active
learning methods rely on diversity techniques, which select a diverse set of
data points to annotate, from an unlabeled pool. In this work, we introduce
Active Data Shapley (ADS) -- a filtering layer for batch active learning that
significantly increases the efficiency of active learning by pre-selecting,
using a linear time computation, the highest-value points from an unlabeled
dataset. Using the notion of the Shapley value of data, our method estimates
the value of unlabeled data points with regards to the prediction task at hand.
We show that ADS is particularly effective when the pool of unlabeled data
exhibits real-world caveats: noise, heterogeneity, and domain shift. We run
experiments demonstrating that when ADS is used to pre-select the
highest-ranking portion of an unlabeled dataset, the efficiency of
state-of-the-art batch active learning methods increases by an average factor
of 6x, while preserving performance effectiveness.
Related papers
- Language Model-Driven Data Pruning Enables Efficient Active Learning [6.816044132563518]
We introduce a plug-and-play unlabeled data pruning strategy, ActivePrune, to prune the unlabeled pool.
To enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method.
Experiments on translation, sentiment analysis, topic classification, and summarization tasks demonstrate that ActivePrune outperforms existing data pruning methods.
arXiv Detail & Related papers (2024-10-05T19:46:11Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Novel Batch Active Learning Approach and Its Application to Synthetic
Aperture Radar Datasets [7.381841249558068]
Recent gains have been made using sequential active learning for synthetic aperture radar (SAR) data arXiv:2204.00005.
We developed a novel, two-part approach for batch active learning: Dijkstra's Annulus Core-Set (DAC) for core-set generation and LocalMax for batch sampling.
The batch active learning process that combines DAC and LocalMax achieves nearly identical accuracy as sequential active learning but is more efficient, proportional to the batch size.
arXiv Detail & Related papers (2023-07-19T23:25:21Z) - Active learning for data streams: a survey [0.48951183832371004]
Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream.
Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data.
This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in real time.
arXiv Detail & Related papers (2023-02-17T14:24:13Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - ALLSH: Active Learning Guided by Local Sensitivity and Hardness [98.61023158378407]
We propose to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function.
Our method achieves consistent gains over the commonly used active learning strategies in various classification tasks.
arXiv Detail & Related papers (2022-05-10T15:39:11Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - One-Round Active Learning [13.25385227263705]
One-round active learning aims to select a subset of unlabeled data points that achieve the highest utility after being labeled.
We propose DULO, a general framework for one-round active learning based on the notion of data utility functions.
Our results demonstrate that while existing active learning approaches could succeed with multiple rounds, DULO consistently performs better in the one-round setting.
arXiv Detail & Related papers (2021-04-23T23:59:50Z) - Semi-supervised Batch Active Learning via Bilevel Optimization [89.37476066973336]
We formulate our approach as a data summarization problem via bilevel optimization.
We show that our method is highly effective in keyword detection tasks in the regime when only few labeled samples are available.
arXiv Detail & Related papers (2020-10-19T16:53:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.