Related papers: Improving clustering quality evaluation in noisy Gaussian mixtures

Improving clustering quality evaluation in noisy Gaussian mixtures

URL: http://arxiv.org/abs/2503.00379v2
Date: Thu, 27 Mar 2025 10:37:18 GMT
Title: Improving clustering quality evaluation in noisy Gaussian mixtures
Authors: Renato Cordeiro de Amorim, Vladimir Makarenkov,
Abstract summary: We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation.<n>We demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features.
Score: 2.3940819037450987
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets. We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features. The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable.

Related papers

Refining Filter Global Feature Weighting for Fully-Unsupervised Clustering [0.0]
In unsupervised learning, effective clustering plays a vital role in revealing patterns and insights from unlabeled data. This paper explores feature weighting for clustering and presents new weighting strategies, including methods based on SHAP (SHapley Additive exPlanations) Our empirical evaluations demonstrate that feature weighting based on SHAP can enhance unsupervised clustering quality, achieving up to a 22.69% improvement over other weighting methods.
arXiv Detail & Related papers (2025-03-12T13:14:09Z)
Does Unsupervised Domain Adaptation Improve the Robustness of Amortized Bayesian Inference? A Systematic Evaluation [3.4109073456116477]
Recent robust approaches employ unsupervised domain adaptation (UDA) to match the embedding spaces of simulated and observed data.<n>We demonstrate that aligning summary spaces between domains effectively mitigates the impact of unmodeled phenomena or noise.<n>Our results underscore the need for careful consideration of misspecification types when using UDA techniques to increase the robustness of ABI in practice.
arXiv Detail & Related papers (2025-02-07T14:13:51Z)
Interaction-Aware Gaussian Weighting for Clustered Federated Learning [58.92159838586751]
Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy.<n>We propose a novel clustered FL method, FedGWC (Federated Gaussian Weighting Clustering), which groups clients based on their data distribution.<n>Our experiments on benchmark datasets show that FedGWC outperforms existing FL algorithms in cluster quality and classification accuracy.
arXiv Detail & Related papers (2025-02-05T16:33:36Z)
How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence [23.019102917957152]
datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. We propose Kernel Divergence Score (KDS), a novel method that quantifies dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings. KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines.
arXiv Detail & Related papers (2025-02-02T05:50:39Z)
Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks. We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z)
From A-to-Z Review of Clustering Validation Indices [4.08908337437878]
We review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms. We suggest a classification framework for examining the functionality of both internal and external clustering validation measures.
arXiv Detail & Related papers (2024-07-18T13:52:02Z)
Deep Clustering Evaluation: How to Validate Internal Clustering Validation Measures [2.2252684361733284]
Deep clustering is a method for partitioning complex, high-dimensional data using deep neural networks. Traditional clustering validation measures, designed for low-dimensional spaces, are problematic for deep clustering. This paper addresses these challenges in evaluating clustering quality in deep learning.
arXiv Detail & Related papers (2024-03-21T20:43:44Z)
Cluster Metric Sensitivity to Irrelevant Features [0.0]
We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features.
arXiv Detail & Related papers (2024-02-19T10:02:00Z)
Sanitized Clustering against Confounding Bias [38.928080236294775]
This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB) SCAB removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. Experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance.
arXiv Detail & Related papers (2023-11-02T14:10:14Z)
Fairness in Visual Clustering: A Novel Transformer Clustering Approach [32.806921406869996]
We first evaluate demographic bias in deep clustering models from the perspective of cluster purity. A novel loss function is introduced to encourage a purity consistency for all clusters to maintain the fairness aspect. We present a novel attention mechanism, Cross-attention, to measure correlations between multiple clusters.
arXiv Detail & Related papers (2023-04-14T21:59:32Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Doing Great at Estimating CATE? On the Neglected Assumptions in Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators. We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z)
Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain. We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one. Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.