Crowdsourcing Without People: Modelling Clustering Algorithms as Experts
- URL: http://arxiv.org/abs/2509.25395v1
- Date: Mon, 29 Sep 2025 18:52:37 GMT
- Title: Crowdsourcing Without People: Modelling Clustering Algorithms as Experts
- Authors: Jordyn E. A. Lorentz, Katharine M. Clark,
- Abstract summary: mixsemble is an ensemble method that adapts the Dawid-Skene model to aggregate predictions from multiple model-based clustering algorithms.<n>Unlike traditional crowdsourcing, which relies on human labels, the framework models the outputs of clustering algorithms as noisy annotations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces mixsemble, an ensemble method that adapts the Dawid-Skene model to aggregate predictions from multiple model-based clustering algorithms. Unlike traditional crowdsourcing, which relies on human labels, the framework models the outputs of clustering algorithms as noisy annotations. Experiments on both simulated and real-world datasets show that, although the mixsemble is not always the single top performer, it consistently approaches the best result and avoids poor outcomes. This robustness makes it a practical alternative when the true data structure is unknown, especially for non-expert users.
Related papers
- Hierarchical Clustering With Confidence [6.4793198569929356]
Agglomerative hierarchical clustering is highly sensitive to small perturbations in the data.<n>We show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures.
arXiv Detail & Related papers (2025-12-06T18:18:20Z) - Robust Mixture Models for Algorithmic Fairness Under Latent Heterogeneity [8.425890077048374]
We introduce ROME, a framework that learns latent group structure from data while optimizing for worst-group performance.<n>ROME significantly improves algorithmic fairness compared to standard methods while maintaining competitive average performance.<n>Our method requires no predefined group labels, making it practical when sources of disparities are unknown or evolving.
arXiv Detail & Related papers (2025-09-22T07:03:33Z) - Two Is Better Than One: Aligned Representation Pairs for Anomaly Detection [56.57122939745213]
Anomaly detection focuses on identifying samples that deviate from the norm.<n>Recent self-supervised methods have successfully learned such representations by employing prior knowledge about anomalies to create synthetic outliers during training.<n>We address this limitation with our new approach Con$$, which leverages prior knowledge about symmetries in normal samples to observe the data in different contexts.
arXiv Detail & Related papers (2024-05-29T07:59:06Z) - Coupled Confusion Correction: Learning from Crowds with Sparse
Annotations [43.94012824749425]
Confusion matrices learned by two models can be corrected by the distilled data from the other.
We cluster the annotator groups'' who share similar expertise so that their confusion matrices could be corrected together.
arXiv Detail & Related papers (2023-12-12T14:47:26Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - A parallelizable model-based approach for marginal and multivariate
clustering [0.0]
This paper develops a clustering method that takes advantage of the sturdiness of model-based clustering.
We tackle this issue by specifying a finite mixture model per margin that allows each margin to have a different number of clusters.
The proposed approach is computationally appealing as well as more tractable for moderate to high dimensions than a full' (joint) model-based clustering approach.
arXiv Detail & Related papers (2022-12-07T23:54:41Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Personalized Federated Learning via Convex Clustering [72.15857783681658]
We propose a family of algorithms for personalized federated learning with locally convex user costs.
The proposed framework is based on a generalization of convex clustering in which the differences between different users' models are penalized.
arXiv Detail & Related papers (2022-02-01T19:25:31Z) - Multi-output Gaussian Processes for Uncertainty-aware Recommender
Systems [3.908842679355254]
We introduce an efficient strategy for model training and inference, resulting in a model that scales to very large and sparse datasets.
Our model also provides meaningful uncertainty estimates about quantifying that prediction.
arXiv Detail & Related papers (2021-06-08T10:01:14Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Discrete-Valued Latent Preference Matrix Estimation with Graph Side
Information [12.836994708337144]
We develop an algorithm that matches the optimal sample complexity.
Our algorithm is robust to model errors and outperforms the existing algorithms in terms of prediction performance.
arXiv Detail & Related papers (2020-03-16T06:29:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.