Adversarial Subspace Generation for Outlier Detection in High-Dimensional Data
- URL: http://arxiv.org/abs/2504.07522v1
- Date: Thu, 10 Apr 2025 07:40:02 GMT
- Title: Adversarial Subspace Generation for Outlier Detection in High-Dimensional Data
- Authors: Jose Cribeiro-Ramallo, Federico Matteucci, Paul Enciu, Alexander Jenke, Vadim Arzamasov, Thorsten Strufe, Klemens Böhm,
- Abstract summary: We introduce Myopic Subspace Theory (MST), a new theoretical framework that mathematically formulates the Multiple Views effect.<n>Based on MST, we introduce V-GAN, a generative method trained to solve such an optimization problem.<n> Experiments on 42 real-world datasets show that using V-GAN leads to a significant increase in one-class classification performance.
- Score: 41.09135146101542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Outlier detection in high-dimensional tabular data is challenging since data is often distributed across multiple lower-dimensional subspaces -- a phenomenon known as the Multiple Views effect (MV). This effect led to a large body of research focused on mining such subspaces, known as subspace selection. However, as the precise nature of the MV effect was not well understood, traditional methods had to rely on heuristic-driven search schemes that struggle to accurately capture the true structure of the data. Properly identifying these subspaces is critical for unsupervised tasks such as outlier detection or clustering, where misrepresenting the underlying data structure can hinder the performance. We introduce Myopic Subspace Theory (MST), a new theoretical framework that mathematically formulates the Multiple Views effect and writes subspace selection as a stochastic optimization problem. Based on MST, we introduce V-GAN, a generative method trained to solve such an optimization problem. This approach avoids any exhaustive search over the feature space while ensuring that the intrinsic data structure is preserved. Experiments on 42 real-world datasets show that using V-GAN subspaces to build ensemble methods leads to a significant increase in one-class classification performance -- compared to existing subspace selection, feature selection, and embedding methods. Further experiments on synthetic data show that V-GAN identifies subspaces more accurately while scaling better than other relevant subspace selection methods. These results confirm the theoretical guarantees of our approach and also highlight its practical viability in high-dimensional settings.
Related papers
- How Do Flow Matching Models Memorize and Generalize in Sample Data Subspaces? [10.315743300140966]
Real-world data is often assumed to lie within a low-dimensional structure embedded in high-dimensional space.
In practical settings, we observe only a finite set of samples, forming what we refer to as the sample data subspace.
A major challenge lies in whether generative models can reliably synthesize samples that stay within this subspace.
arXiv Detail & Related papers (2024-10-31T03:08:07Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Learning Structure Aware Deep Spectral Embedding [11.509692423756448]
We propose a novel structure-aware deep spectral embedding by combining a spectral embedding loss and a structure preservation loss.
A deep neural network architecture is proposed that simultaneously encodes both types of information and aims to generate structure-aware spectral embedding.
The proposed algorithm is evaluated on six publicly available real-world datasets.
arXiv Detail & Related papers (2023-05-14T18:18:05Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Tensor Laplacian Regularized Low-Rank Representation for Non-uniformly
Distributed Data Subspace Clustering [2.578242050187029]
Low-Rank Representation (LRR) suffers from discarding the locality information of data points in subspace clustering.
We propose a hypergraph model to facilitate having a variable number of adjacent nodes and incorporating the locality information of the data.
Experiments on artificial and real datasets demonstrate the higher accuracy and precision of the proposed method.
arXiv Detail & Related papers (2021-03-06T08:22:24Z) - A Critique of Self-Expressive Deep Subspace Clustering [23.971512395191308]
Subspace clustering is an unsupervised clustering technique designed to cluster data that is supported on a union of linear subspaces.
We show that there are a number of potential flaws with this approach which have not been adequately addressed in prior work.
arXiv Detail & Related papers (2020-10-08T00:14:59Z) - Joint and Progressive Subspace Analysis (JPSA) with Spatial-Spectral
Manifold Alignment for Semi-Supervised Hyperspectral Dimensionality Reduction [48.73525876467408]
We propose a novel technique for hyperspectral subspace analysis.
The technique is called joint and progressive subspace analysis (JPSA)
Experiments are conducted to demonstrate the superiority and effectiveness of the proposed JPSA on two widely-used hyperspectral datasets.
arXiv Detail & Related papers (2020-09-21T16:29:59Z) - Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF.
It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.