Optimal Clustering from Noisy Binary Feedback
- URL: http://arxiv.org/abs/1910.06002v4
- Date: Mon, 5 Feb 2024 05:07:45 GMT
- Title: Optimal Clustering from Noisy Binary Feedback
- Authors: Kaito Ariu, Jungseul Ok, Alexandre Proutiere, Se-Young Yun
- Abstract summary: We study the problem of clustering a set of items from binary user feedback.
We devise an algorithm with a minimal cluster recovery error rate.
For adaptive selection, we develop an algorithm inspired by the derivation of the information-theoretical error lower bounds.
- Score: 75.17453757892152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of clustering a set of items from binary user feedback.
Such a problem arises in crowdsourcing platforms solving large-scale labeling
tasks with minimal effort put on the users. For example, in some of the recent
reCAPTCHA systems, users clicks (binary answers) can be used to efficiently
label images. In our inference problem, items are grouped into initially
unknown non-overlapping clusters. To recover these clusters, the learner
sequentially presents to users a finite list of items together with a question
with a binary answer selected from a fixed finite set. For each of these items,
the user provides a noisy answer whose expectation is determined by the item
cluster and the question and by an item-specific parameter characterizing the
{\it hardness} of classifying the item. The objective is to devise an algorithm
with a minimal cluster recovery error rate. We derive problem-specific
information-theoretical lower bounds on the error rate satisfied by any
algorithm, for both uniform and adaptive (list, question) selection strategies.
For uniform selection, we present a simple algorithm built upon the K-means
algorithm and whose performance almost matches the fundamental limits. For
adaptive selection, we develop an adaptive algorithm that is inspired by the
derivation of the information-theoretical error lower bounds, and in turn
allocates the budget in an efficient way. The algorithm learns to select items
hard to cluster and relevant questions more often. We compare the performance
of our algorithms with or without the adaptive selection strategy numerically
and illustrate the gain achieved by being adaptive.
Related papers
- Training Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization [52.80408805368928]
We introduce a novel greedy-style subset selection algorithm for batch acquisition.
Our experiments on the red fluorescent proteins show that our proposed method achieves the baseline performance in 1.69x fewer queries.
arXiv Detail & Related papers (2024-06-21T05:57:08Z) - FastGAS: Fast Graph-based Annotation Selection for In-Context Learning [53.17606395275021]
In-context learning (ICL) empowers large language models (LLMs) to tackle new tasks by using a series of training instances as prompts.
Existing methods have proposed to select a subset of unlabeled examples for annotation.
We propose a graph-based selection method, FastGAS, designed to efficiently identify high-quality instances.
arXiv Detail & Related papers (2024-06-06T04:05:54Z) - From Large to Small Datasets: Size Generalization for Clustering
Algorithm Selection [12.993073967843292]
We study a problem in a semi-supervised setting, with an unknown ground-truth clustering.
We introduce a notion of size generalization for clustering algorithm accuracy.
We use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
arXiv Detail & Related papers (2024-02-22T06:53:35Z) - Max-Min Diversification with Fairness Constraints: Exact and
Approximation Algorithms [17.57585822765145]
We propose an exact algorithm that is suitable for small datasets as well as a $frac1-varepsilon integer integer5$-approximation algorithm for any $varepsilon in (0, 1)$ that scales to large datasets.
Experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.
arXiv Detail & Related papers (2023-01-05T13:02:35Z) - Ensemble Method for Cluster Number Determination and Algorithm Selection
in Unsupervised Learning [0.0]
Unsupervised learning suffers from the need for expertise in the field to be of use.
We propose an ensemble clustering framework which can be leveraged with minimal input.
arXiv Detail & Related papers (2021-12-23T04:59:10Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Certifiably Polynomial Algorithm for Best Group Subset Selection [0.9667631210393929]
Best group subset selection aims to choose a small part of non-overlapping groups to achieve the best interpretability on the response variable.
We propose a group-splicing algorithm that iteratively detects effective groups and excludes the helpless ones.
We demonstrate the efficiency and accuracy of our proposal by comparing state-of-the-art algorithms on both synthetic and real-world datasets.
arXiv Detail & Related papers (2021-04-23T03:05:11Z) - Clustering with Penalty for Joint Occurrence of Objects: Computational
Aspects [0.0]
The method of Hol'y, Sokol and vCern'y clusters objects based on their incidence in a large number of given sets.
The idea is to minimize the occurrence of multiple objects from the same cluster in the same set.
In the current paper, we study computational aspects of the method.
arXiv Detail & Related papers (2021-02-02T10:39:27Z) - Extreme Algorithm Selection With Dyadic Feature Representation [78.13985819417974]
We propose the setting of extreme algorithm selection (XAS) where we consider fixed sets of thousands of candidate algorithms.
We assess the applicability of state-of-the-art AS techniques to the XAS setting and propose approaches leveraging a dyadic feature representation.
arXiv Detail & Related papers (2020-01-29T09:40:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.