Related papers: Active clustering for labeling training data

Active clustering for labeling training data

URL: http://arxiv.org/abs/2110.14521v1
Date: Wed, 27 Oct 2021 15:35:58 GMT
Title: Active clustering for labeling training data
Authors: Quentin Lutz, \'Elie de Panafieu, Alex Scott, Maya Stein
Abstract summary: We propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries. We analyze the algorithms that minimize the average number of queries required to cluster the items and analyze their complexity.
Score: 0.8029049649310211
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Gathering training data is a key step of any supervised learning task, and it is both critical and expensive. Critical, because the quantity and quality of the training data has a high impact on the performance of the learned function. Expensive, because most practical cases rely on humans-in-the-loop to label the data. The process of determining the correct labels is much more expensive than comparing two items to see whether they belong to the same class. Thus motivated, we propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries, and the computer groups the items into classes (which can be labeled cheaply at the very end of the process). Given the items, we consider two random models for the classes: one where the set partition they form is drawn uniformly, the other one where each item chooses its class independently following a fixed distribution. In the first model, we characterize the algorithms that minimize the average number of queries required to cluster the items and analyze their complexity. In the second model, we analyze a specific algorithm family, propose as a conjecture that they reach the minimum average number of queries and compare their performance to a random approach. We also propose solutions to handle errors or inconsistencies in the experts' answers.

Related papers

Order is All You Need for Categorical Data Clustering [31.851890008893847]
This paper introduces a new finding that the order relation among attribute values is the decisive factor in clustering accuracy. We propose a new learning paradigm that allows joint learning of clusters and the orders. The algorithm achieves superior clustering accuracy with a convergence guarantee.
arXiv Detail & Related papers (2024-11-19T08:23:25Z)
Generating collective counterfactual explanations in score-based classification via mathematical optimization [4.281723404774889]
A counterfactual explanation of an instance indicates how this instance should be minimally modified so that the perturbed instance is classified in the desired class. Most of the Counterfactual Analysis literature focuses on the single-instance single-counterfactual setting. By means of novel Mathematical Optimization models, we provide a counterfactual explanation for each instance in a group of interest.
arXiv Detail & Related papers (2023-10-19T15:18:42Z)
Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking. xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics. We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z)
Cost-Effective Online Contextual Model Selection [14.094350329970537]
We formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. We propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection.
arXiv Detail & Related papers (2022-07-13T08:22:22Z)
Unsupervised Crowdsourcing with Accuracy and Cost Guarantees [4.008789789191313]
We consider the problem of cost-optimal utilization of a crowdsourcing platform for binary, unsupervised classification of a collection of items. We propose algorithms for acquiring label predictions from workers, and for inferring the true labels of items.
arXiv Detail & Related papers (2022-07-05T12:14:11Z)
Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions. We propose an algorithm that optimize for the worst-off group assignments from a constraint set. We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
Improving Multi-Turn Response Selection Models with Complementary Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals. We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
Optimal Clustering from Noisy Binary Feedback [75.17453757892152]
We study the problem of clustering a set of items from binary user feedback. We devise an algorithm with a minimal cluster recovery error rate. For adaptive selection, we develop an algorithm inspired by the derivation of the information-theoretical error lower bounds.
arXiv Detail & Related papers (2019-10-14T09:18:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.