Hierarchical Qualitative Clustering: clustering mixed datasets with
critical qualitative information
- URL: http://arxiv.org/abs/2006.16701v3
- Date: Mon, 6 Jul 2020 11:07:34 GMT
- Title: Hierarchical Qualitative Clustering: clustering mixed datasets with
critical qualitative information
- Authors: Diogo Seca, Jo\~ao Mendes-Moreira, Tiago Mendes-Neves, Ricardo Sousa
- Abstract summary: We propose a novel method for clustering qualitative values, based on Hierarchical Clustering (HQC) and using Maximum Mean Discrepancy.
Using a mixed dataset provided by Spotify, we showcase how our method can be used for clustering music artists based on the quantitative features of thousands of songs.
In addition, using financial features of companies, we cluster company industries, and discuss the implications in investment portfolios.
- Score: 0.2294014185517203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering can be used to extract insights from data or to verify some of the
assumptions held by the domain experts, namely data segmentation. In the
literature, few methods can be applied in clustering qualitative values using
the context associated with other variables present in the data, without losing
interpretability. Moreover, the metrics for calculating dissimilarity between
qualitative values often scale poorly for high dimensional mixed datasets.
In this study, we propose a novel method for clustering qualitative values,
based on Hierarchical Clustering (HQC), and using Maximum Mean Discrepancy. HQC
maintains the original interpretability of the qualitative information present
in the dataset. We apply HQC to two datasets. Using a mixed dataset provided by
Spotify, we showcase how our method can be used for clustering music artists
based on the quantitative features of thousands of songs. In addition, using
financial features of companies, we cluster company industries, and discuss the
implications in investment portfolios diversification.
Related papers
- A Machine Learning-Based Framework for Clustering Residential
Electricity Load Profiles to Enhance Demand Response Programs [0.0]
We present a novel machine learning based framework in order to achieve optimal load profiling through a real case study.
In this paper, we present a novel machine learning based framework in order to achieve optimal load profiling through a real case study.
arXiv Detail & Related papers (2023-10-31T11:23:26Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Using Decision Trees for Interpretable Supervised Clustering [0.0]
supervised clustering aims at forming clusters of labelled data with high probability densities.
We are particularly interested in finding clusters of data of a given class and describing the clusters with the set of comprehensive rules.
arXiv Detail & Related papers (2023-07-16T17:12:45Z) - Enhancing Cluster Quality of Numerical Datasets with Domain Ontology [2.790947019327459]
Ontology-based clustering can produce either high quality or low-quality clusters from a dataset.
We present a clustering approach that is based on domain ontology to reduce the dimensionality of attributes in a numerical dataset.
The experimental results of our approach indicate that cluster quality gradually improves from lower to the higher levels of a domain ontology.
arXiv Detail & Related papers (2023-04-02T23:40:17Z) - Neural Capacitated Clustering [6.155158115218501]
We propose a new method for the Capacitated Clustering Problem (CCP) that learns a neural network to predict the assignment probabilities of points to cluster centers.
In our experiments on artificial data and two real world datasets our approach outperforms several state-of-the-art mathematical and solvers from the literature.
arXiv Detail & Related papers (2023-02-10T09:33:44Z) - Deep Clustering: A Comprehensive Survey [53.387957674512585]
Clustering analysis plays an indispensable role in machine learning and data mining.
Deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range of clustering tasks.
Existing surveys for deep clustering mainly focus on the single-view fields and the network architectures, ignoring the complex application scenarios of clustering.
arXiv Detail & Related papers (2022-10-09T02:31:32Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Too Much Information Kills Information: A Clustering Perspective [6.375668163098171]
We propose a simple, but novel approach for variance-based k-clustering tasks, including in which is the widely known k-means clustering.
The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only.
With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability.
arXiv Detail & Related papers (2020-09-16T01:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.