Active clustering for labeling training data
- URL: http://arxiv.org/abs/2110.14521v1
- Date: Wed, 27 Oct 2021 15:35:58 GMT
- Title: Active clustering for labeling training data
- Authors: Quentin Lutz, \'Elie de Panafieu, Alex Scott, Maya Stein
- Abstract summary: We propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries.
We analyze the algorithms that minimize the average number of queries required to cluster the items and analyze their complexity.
- Score: 0.8029049649310211
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Gathering training data is a key step of any supervised learning task, and it
is both critical and expensive. Critical, because the quantity and quality of
the training data has a high impact on the performance of the learned function.
Expensive, because most practical cases rely on humans-in-the-loop to label the
data. The process of determining the correct labels is much more expensive than
comparing two items to see whether they belong to the same class. Thus
motivated, we propose a setting for training data gathering where the human
experts perform the comparatively cheap task of answering pairwise queries, and
the computer groups the items into classes (which can be labeled cheaply at the
very end of the process). Given the items, we consider two random models for
the classes: one where the set partition they form is drawn uniformly, the
other one where each item chooses its class independently following a fixed
distribution. In the first model, we characterize the algorithms that minimize
the average number of queries required to cluster the items and analyze their
complexity. In the second model, we analyze a specific algorithm family,
propose as a conjecture that they reach the minimum average number of queries
and compare their performance to a random approach. We also propose solutions
to handle errors or inconsistencies in the experts' answers.
Related papers
- Generating collective counterfactual explanations in score-based
classification via mathematical optimization [4.281723404774889]
A counterfactual explanation of an instance indicates how this instance should be minimally modified so that the perturbed instance is classified in the desired class.
Most of the Counterfactual Analysis literature focuses on the single-instance single-counterfactual setting.
By means of novel Mathematical Optimization models, we provide a counterfactual explanation for each instance in a group of interest.
arXiv Detail & Related papers (2023-10-19T15:18:42Z) - Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking.
xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics.
We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z) - Cost-Effective Online Contextual Model Selection [14.094350329970537]
We formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context.
The goal is to output the best model for any given context without obtaining an excessive amount of labels.
We propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection.
arXiv Detail & Related papers (2022-07-13T08:22:22Z) - Unsupervised Crowdsourcing with Accuracy and Cost Guarantees [4.008789789191313]
We consider the problem of cost-optimal utilization of a crowdsourcing platform for binary, unsupervised classification of a collection of items.
We propose algorithms for acquiring label predictions from workers, and for inferring the true labels of items.
arXiv Detail & Related papers (2022-07-05T12:14:11Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z) - Optimal Clustering from Noisy Binary Feedback [75.17453757892152]
We study the problem of clustering a set of items from binary user feedback.
We devise an algorithm with a minimal cluster recovery error rate.
For adaptive selection, we develop an algorithm inspired by the derivation of the information-theoretical error lower bounds.
arXiv Detail & Related papers (2019-10-14T09:18:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.