Hierarchical Qualitative Clustering: clustering mixed datasets with
critical qualitative information
- URL: http://arxiv.org/abs/2006.16701v3
- Date: Mon, 6 Jul 2020 11:07:34 GMT
- Title: Hierarchical Qualitative Clustering: clustering mixed datasets with
critical qualitative information
- Authors: Diogo Seca, Jo\~ao Mendes-Moreira, Tiago Mendes-Neves, Ricardo Sousa
- Abstract summary: We propose a novel method for clustering qualitative values, based on Hierarchical Clustering (HQC) and using Maximum Mean Discrepancy.
Using a mixed dataset provided by Spotify, we showcase how our method can be used for clustering music artists based on the quantitative features of thousands of songs.
In addition, using financial features of companies, we cluster company industries, and discuss the implications in investment portfolios.
- Score: 0.2294014185517203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering can be used to extract insights from data or to verify some of the
assumptions held by the domain experts, namely data segmentation. In the
literature, few methods can be applied in clustering qualitative values using
the context associated with other variables present in the data, without losing
interpretability. Moreover, the metrics for calculating dissimilarity between
qualitative values often scale poorly for high dimensional mixed datasets.
In this study, we propose a novel method for clustering qualitative values,
based on Hierarchical Clustering (HQC), and using Maximum Mean Discrepancy. HQC
maintains the original interpretability of the qualitative information present
in the dataset. We apply HQC to two datasets. Using a mixed dataset provided by
Spotify, we showcase how our method can be used for clustering music artists
based on the quantitative features of thousands of songs. In addition, using
financial features of companies, we cluster company industries, and discuss the
implications in investment portfolios diversification.
Related papers
- Depth-Based Local Center Clustering: A Framework for Handling Different Clustering Scenarios [46.164361878412656]
Cluster analysis plays a crucial role across numerous scientific and engineering domains.<n>Despite the wealth of clustering methods proposed over the past decades, each method is typically designed for specific scenarios.<n>In this paper, we propose depth-based clustering (DLCC)<n>DLCC makes use of a local version of data depth that is based on subsets of data
arXiv Detail & Related papers (2025-05-14T16:08:11Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification [2.666791490663749]
Cross-Modality Clustering-based Self-Labeling ( CMCSL)
CMCSL groups instances belonging to each modality in the deep feature space and then propagates known labels within the resulting clusters.
Experimental evaluation conducted on 20 datasets derived from the MM-IMDb dataset.
arXiv Detail & Related papers (2024-08-05T15:43:56Z) - A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data [0.0]
We present an information-theoretic method for clustering mixed-type data.
The proposed approach is built on the deterministic variant of the Information Bottleneck algorithm.
We evaluate the performance of our method against four well-established clustering techniques.
arXiv Detail & Related papers (2024-07-03T09:06:19Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Using Decision Trees for Interpretable Supervised Clustering [0.0]
supervised clustering aims at forming clusters of labelled data with high probability densities.
We are particularly interested in finding clusters of data of a given class and describing the clusters with the set of comprehensive rules.
arXiv Detail & Related papers (2023-07-16T17:12:45Z) - Enhancing Cluster Quality of Numerical Datasets with Domain Ontology [2.790947019327459]
Ontology-based clustering can produce either high quality or low-quality clusters from a dataset.
We present a clustering approach that is based on domain ontology to reduce the dimensionality of attributes in a numerical dataset.
The experimental results of our approach indicate that cluster quality gradually improves from lower to the higher levels of a domain ontology.
arXiv Detail & Related papers (2023-04-02T23:40:17Z) - Neural Capacitated Clustering [6.155158115218501]
We propose a new method for the Capacitated Clustering Problem (CCP) that learns a neural network to predict the assignment probabilities of points to cluster centers.
In our experiments on artificial data and two real world datasets our approach outperforms several state-of-the-art mathematical and solvers from the literature.
arXiv Detail & Related papers (2023-02-10T09:33:44Z) - Deep Clustering: A Comprehensive Survey [53.387957674512585]
Clustering analysis plays an indispensable role in machine learning and data mining.
Deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range of clustering tasks.
Existing surveys for deep clustering mainly focus on the single-view fields and the network architectures, ignoring the complex application scenarios of clustering.
arXiv Detail & Related papers (2022-10-09T02:31:32Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Too Much Information Kills Information: A Clustering Perspective [6.375668163098171]
We propose a simple, but novel approach for variance-based k-clustering tasks, including in which is the widely known k-means clustering.
The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only.
With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability.
arXiv Detail & Related papers (2020-09-16T01:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.