Related papers: Low Budget Active Learning via Wasserstein Distance: An Integer Programming Approach

Low Budget Active Learning via Wasserstein Distance: An Integer Programming Approach

URL: http://arxiv.org/abs/2106.02968v1
Date: Sat, 5 Jun 2021 21:25:03 GMT
Title: Low Budget Active Learning via Wasserstein Distance: An Integer Programming Approach
Authors: Rafid Mahmood, Sanja Fidler, Marc T. Law
Abstract summary: Active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label. We propose a new integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool. Our strategy requires high-quality latent features which we obtain by unsupervised learning on the unlabeled pool.
Score: 81.19737119343438
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Given restrictions on the availability of data, active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label. Although selecting the most useful points for training is an optimization problem, the scale of deep learning data sets forces most selection strategies to employ efficient heuristics. Instead, we propose a new integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool. We demonstrate that this problem can be tractably solved with a Generalized Benders Decomposition algorithm. Our strategy requires high-quality latent features which we obtain by unsupervised learning on the unlabeled pool. Numerical results on several data sets show that our optimization approach is competitive with baselines and particularly outperforms them in the low budget regime where less than one percent of the data set is labeled.

Related papers

Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection [2.7554677967598047]
adversarially robust learning is widely recognized to demand significantly more training examples. Recent works propose the use of self-supervised adversarial training with external or synthetically generated unlabeled data to enhance model robustness. We propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement.
arXiv Detail & Related papers (2025-01-15T15:47:49Z)
Language Model-Driven Data Pruning Enables Efficient Active Learning [6.816044132563518]
We introduce a plug-and-play unlabeled data pruning strategy, ActivePrune, to prune the unlabeled pool. To enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method. Experiments on translation, sentiment analysis, topic classification, and summarization tasks demonstrate that ActivePrune outperforms existing data pruning methods.
arXiv Detail & Related papers (2024-10-05T19:46:11Z)
Optimal and Efficient Binary Questioning for Human-in-the-Loop Annotation [11.4375764457726]
This paper studies the neglected complementary problem of getting annotated data given a predictor. For the simple binary classification setting, we present the spectrum ranging from optimal general solutions to practical efficient methods.
arXiv Detail & Related papers (2023-07-04T09:11:33Z)
Probabilistic Bilevel Coreset Selection [24.874967723659022]
We propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample. We develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation.
arXiv Detail & Related papers (2023-01-24T09:37:00Z)
Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z)
A Simple Baseline for Low-Budget Active Learning [15.54250249254414]
We show that a simple k-means clustering algorithm can outperform state-of-the-art active learning methods on low budgets. This method can be used as a simple baseline for low-budget active learning on image classification.
arXiv Detail & Related papers (2021-10-22T19:36:56Z)
Partial Wasserstein Covering [10.52782170493037]
We consider a general task called partial Wasserstein covering with the goal of emulating a large dataset. We model this problem as a discrete optimization problem with partial Wasserstein divergence as an objective function. We show that we can efficiently make two datasets similar in terms of partial Wasserstein divergence, including driving scene datasets.
arXiv Detail & Related papers (2021-06-02T01:48:41Z)
Semi-supervised Batch Active Learning via Bilevel Optimization [89.37476066973336]
We formulate our approach as a data summarization problem via bilevel optimization. We show that our method is highly effective in keyword detection tasks in the regime when only few labeled samples are available.
arXiv Detail & Related papers (2020-10-19T16:53:24Z)
Semi-Supervised Learning with Meta-Gradient [123.26748223837802]
We propose a simple yet effective meta-learning algorithm in semi-supervised learning. We find that the proposed algorithm performs favorably against state-of-the-art methods.
arXiv Detail & Related papers (2020-07-08T08:48:56Z)
Gradient Descent in RKHS with Importance Labeling [58.79085525115987]
We study importance labeling problem, in which we are given many unlabeled data. We propose a new importance labeling scheme that can effectively select an informative subset of unlabeled data.
arXiv Detail & Related papers (2020-06-19T01:55:00Z)
Progressive Identification of True Labels for Partial-Label Learning [112.94467491335611]
Partial-label learning (PLL) is a typical weakly supervised learning problem, where each training instance is equipped with a set of candidate labels among which only one is the true label. Most existing methods elaborately designed as constrained optimizations that must be solved in specific manners, making their computational complexity a bottleneck for scaling up to big data. This paper proposes a novel framework of classifier with flexibility on the model and optimization algorithm.
arXiv Detail & Related papers (2020-02-19T08:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.