UCSL : A Machine Learning Expectation-Maximization framework for
Unsupervised Clustering driven by Supervised Learning
- URL: http://arxiv.org/abs/2107.01988v1
- Date: Mon, 5 Jul 2021 12:55:13 GMT
- Title: UCSL : A Machine Learning Expectation-Maximization framework for
Unsupervised Clustering driven by Supervised Learning
- Authors: Robin Louiset and Pietro Gori and Benoit Dufumier and Josselin Houenou
and Antoine Grigis and Edouard Duchesnay
- Abstract summary: Subtype Discovery consists in finding interpretable and consistent sub-parts of a dataset, which are also relevant to a certain supervised task.
We propose a general Expectation-Maximization ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised Learning)
Our method is generic, it can integrate any clustering method and can be driven by both binary classification and regression.
- Score: 2.133032470368051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subtype Discovery consists in finding interpretable and consistent sub-parts
of a dataset, which are also relevant to a certain supervised task. From a
mathematical point of view, this can be defined as a clustering task driven by
supervised learning in order to uncover subgroups in line with the supervised
prediction. In this paper, we propose a general Expectation-Maximization
ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised
Learning). Our method is generic, it can integrate any clustering method and
can be driven by both binary classification and regression. We propose to
construct a non-linear model by merging multiple linear estimators, one per
cluster. Each hyperplane is estimated so that it correctly discriminates - or
predict - only one cluster. We use SVC or Logistic Regression for
classification and SVR for regression. Furthermore, to perform cluster analysis
within a more suitable space, we also propose a dimension-reduction algorithm
that projects the data onto an orthonormal space relevant to the supervised
task. We analyze the robustness and generalization capability of our algorithm
using synthetic and experimental datasets. In particular, we validate its
ability to identify suitable consistent sub-types by conducting a
psychiatric-diseases cluster analysis with known ground-truth labels. The gain
of the proposed method over previous state-of-the-art techniques is about +1.9
points in terms of balanced accuracy. Finally, we make codes and examples
available in a scikit-learn-compatible Python package at
https://github.com/neurospin-projects/2021_rlouiset_ucsl
Related papers
- Can an unsupervised clustering algorithm reproduce a categorization system? [1.0485739694839669]
We investigate whether unsupervised clustering can reproduce ground truth classes in a labeled dataset.
We show that success depends on feature selection and the chosen distance metric.
arXiv Detail & Related papers (2024-08-19T18:27:14Z) - Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels.
Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Interpretable Deep Clustering for Tabular Data [7.972599673048582]
Clustering is a fundamental learning task widely used in data analysis.
We propose a new deep-learning framework that predicts interpretable cluster assignments at the instance and cluster levels.
We show that the proposed method can reliably predict cluster assignments in biological, text, image, and physics datasets.
arXiv Detail & Related papers (2023-06-07T21:08:09Z) - A Generalized Framework for Predictive Clustering and Optimization [18.06697544912383]
Clustering is a powerful and extensively used data science tool.
In this article, we define a generalized optimization framework for predictive clustering.
We also present a joint optimization strategy that exploits mixed-integer linear programming (MILP) for global optimization.
arXiv Detail & Related papers (2023-05-07T19:56:51Z) - Self-Supervised Class Incremental Learning [51.62542103481908]
Existing Class Incremental Learning (CIL) methods are based on a supervised classification framework sensitive to data labels.
When updating them based on the new class data, they suffer from catastrophic forgetting: the model cannot discern old class data clearly from the new.
In this paper, we explore the performance of Self-Supervised representation learning in Class Incremental Learning (SSCIL) for the first time.
arXiv Detail & Related papers (2021-11-18T06:58:19Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Learning Self-Expression Metrics for Scalable and Inductive Subspace
Clustering [5.587290026368626]
Subspace clustering has established itself as a state-of-the-art approach to clustering high-dimensional data.
We propose a novel metric learning approach to learn instead a subspace affinity function using a siamese neural network architecture.
Our model benefits from a constant number of parameters and a constant-size memory footprint, allowing it to scale to considerably larger datasets.
arXiv Detail & Related papers (2020-09-27T15:40:12Z) - Kernel learning approaches for summarising and combining posterior
similarity matrices [68.8204255655161]
We build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.
A key contribution of our work is the observation that PSMs are positive semi-definite, and hence can be used to define probabilistically-motivated kernel matrices.
arXiv Detail & Related papers (2020-09-27T14:16:14Z) - Structured Graph Learning for Clustering and Semi-supervised
Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data.
Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure.
Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z) - Robust Self-Supervised Convolutional Neural Network for Subspace
Clustering and Classification [0.10152838128195464]
This paper proposes the robust formulation of the self-supervised convolutional subspace clustering network ($S2$ConvSCN)
In a truly unsupervised training environment, Robust $S2$ConvSCN outperforms its baseline version by a significant amount for both seen and unseen data on four well-known datasets.
arXiv Detail & Related papers (2020-04-03T16:07:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.