Why Can't I See My Clusters? A Precision-Recall Approach to Dimensionality Reduction Validation
- URL: http://arxiv.org/abs/2509.04222v1
- Date: Thu, 04 Sep 2025 13:53:16 GMT
- Title: Why Can't I See My Clusters? A Precision-Recall Approach to Dimensionality Reduction Validation
- Authors: Diede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich,
- Abstract summary: Dimensionality Reduction (DR) is widely used for visualizing high-dimensional data, often with the goal of revealing expected cluster structure.<n>Existing DR quality metrics assess projection reliability (to some extent) or cluster structure quality, but do not explain why expected structures are missing.<n>This paper addresses this problem by leveraging a recent framework that divides the DR process into two phases: a relationship phase, where similarity relationships are modeled, and a mapping phase, where the data is projected accordingly.
- Score: 46.5272770104348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dimensionality Reduction (DR) is widely used for visualizing high-dimensional data, often with the goal of revealing expected cluster structure. However, such a structure may not always appear in the projections. Existing DR quality metrics assess projection reliability (to some extent) or cluster structure quality, but do not explain why expected structures are missing. Visual Analytics solutions can help, but are often time-consuming due to the large hyperparameter space. This paper addresses this problem by leveraging a recent framework that divides the DR process into two phases: a relationship phase, where similarity relationships are modeled, and a mapping phase, where the data is projected accordingly. We introduce two supervised metrics, precision and recall, to evaluate the relationship phase. These metrics quantify how well the modeled relationships align with an expected cluster structure based on some set of labels representing this structure. We illustrate their application using t-SNE and UMAP, and validate the approach through various usage scenarios. Our approach can guide hyperparameter tuning, uncover projection artifacts, and determine if the expected structure is captured in the relationships, making the DR process faster and more reliable.
Related papers
- Twinning Complex Networked Systems: Data-Driven Calibration of the mABCD Synthetic Graph Generator [2.6776012440607784]
We propose a method for estimating matching configurations and for quantifying the associated error.<n>Our results demonstrate that this task is non-trivial, as strong interdependencies between configuration parameters weaken independent estimation and instead favour a joint-prediction approach.
arXiv Detail & Related papers (2026-02-02T12:40:19Z) - A Survey of Dimension Estimation Methods [0.0]
It is important to understand the real dimension of the data, hence the complexity of the dataset at hand.<n>This survey reviews a wide range of dimension estimation methods, categorising them by the geometric information they exploit.<n>The paper evaluates the performance of these methods, as well as investigating varying responses to curvature and noise.
arXiv Detail & Related papers (2025-07-18T13:05:42Z) - Measuring the Predictability of Recommender Systems using Structural Complexity Metrics [0.6429591199690016]
This study introduces data-driven metrics to measure the predictability of RS based on the structural complexity of the user-item rating matrix.
A low predictability score indicates complex and unpredictable user-item interactions, while a high predictability score reveals less complex patterns with predictive potential.
arXiv Detail & Related papers (2024-04-12T22:00:27Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.<n>In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.<n>This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching)
To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth.
We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z) - PromptORE -- A Novel Approach Towards Fully Unsupervised Relation
Extraction [0.0]
Unsupervised Relation Extraction (RE) aims to identify relations between entities in text, without having access to labeled data during training.
We propose PromptORE, a ''Prompt-based Open Relation Extraction'' model.
We adapt the novel prompt-tuning paradigm to work in an unsupervised setting, and use it to embed sentences expressing a relation.
We show that PromptORE consistently outperforms state-of-the-art models with a relative gain of more than 40% in B 3, V-measure and ARI.
arXiv Detail & Related papers (2023-03-24T12:55:35Z) - Representation Disentaglement via Regularization by Causal
Identification [3.9160947065896803]
We propose the use of a causal collider structured model to describe the underlying data generative process assumptions in disentangled representation learning.
For this, we propose regularization by identification (ReI), a modular regularization engine designed to align the behavior of large scale generative models with the disentanglement constraints imposed by causal identification.
arXiv Detail & Related papers (2023-02-28T23:18:54Z) - Design of Compressed Sensing Systems via Density-Evolution Framework for
Structure Recovery in Graphical Models [10.667885727418705]
It has been shown that learning the structure of Bayesian networks from observational data is an NP-Hard problem.
We propose a novel density-evolution based framework for optimizing compressed linear measurement systems.
We show that the structure of GBN can indeed be recovered from resulting compressed measurements.
arXiv Detail & Related papers (2022-03-17T22:16:38Z) - Structural Causal Models Are (Solvable by) Credal Networks [70.45873402967297]
Causal inferences can be obtained by standard algorithms for the updating of credal nets.
This contribution should be regarded as a systematic approach to represent structural causal models by credal networks.
Experiments show that approximate algorithms for credal networks can immediately be used to do causal inference in real-size problems.
arXiv Detail & Related papers (2020-08-02T11:19:36Z) - Supporting Optimal Phase Space Reconstructions Using Neural Network
Architecture for Time Series Modeling [68.8204255655161]
We propose an artificial neural network with a mechanism to implicitly learn the phase spaces properties.
Our approach is either as competitive as or better than most state-of-the-art strategies.
arXiv Detail & Related papers (2020-06-19T21:04:47Z) - Transformer Hawkes Process [79.16290557505211]
We propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies.
THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin.
We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.
arXiv Detail & Related papers (2020-02-21T13:48:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.