Revisiting data augmentation for subspace clustering
- URL: http://arxiv.org/abs/2207.09728v1
- Date: Wed, 20 Jul 2022 08:13:08 GMT
- Title: Revisiting data augmentation for subspace clustering
- Authors: Maryam Abdolali, Nicolas Gillis
- Abstract summary: Subspace clustering is the classical problem of clustering a collection of data samples around several low-dimensional subspaces.
We argue that data distribution within each subspace plays a critical role in the success of self-expressive models.
We propose two subspace clustering frameworks for both unsupervised and semi-supervised settings.
- Score: 21.737226432466496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subspace clustering is the classical problem of clustering a collection of
data samples that approximately lie around several low-dimensional subspaces.
The current state-of-the-art approaches for this problem are based on the
self-expressive model which represents the samples as linear combination of
other samples. However, these approaches require sufficiently well-spread
samples for accurate representation which might not be necessarily accessible
in many applications. In this paper, we shed light on this commonly neglected
issue and argue that data distribution within each subspace plays a critical
role in the success of self-expressive models. Our proposed solution to tackle
this issue is motivated by the central role of data augmentation in the
generalization power of deep neural networks. We propose two subspace
clustering frameworks for both unsupervised and semi-supervised settings that
use augmented samples as an enlarged dictionary to improve the quality of the
self-expressive representation. We present an automatic augmentation strategy
using a few labeled samples for the semi-supervised problem relying on the fact
that the data samples lie in the union of multiple linear subspaces.
Experimental results confirm the effectiveness of data augmentation, as it
significantly improves the performance of general self-expressive models.
Related papers
- Data Pruning in Generative Diffusion Models [2.0111637969968]
Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets.
We show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically.
arXiv Detail & Related papers (2024-11-19T14:13:25Z) - GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - Deep Generative Sampling in the Dual Divergence Space: A Data-efficient & Interpretative Approach for Generative AI [29.13807697733638]
We build on the remarkable achievements in generative sampling of natural images.
We propose an innovative challenge, potentially overly ambitious, which involves generating samples that resemble images.
The statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects.
arXiv Detail & Related papers (2024-04-10T22:35:06Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Unsupervised Manifold Linearizing and Clustering [19.879641608165887]
We propose to optimize the Maximal Coding Reduction metric with respect to both the data representation and a novel doubly cluster membership.
Experiments on CIFAR-10, -20, -100, and TinyImageNet-200 datasets show that the proposed method is much more accurate and scalable than state-of-the-art deep clustering methods.
arXiv Detail & Related papers (2023-01-04T20:08:23Z) - Synthetic-to-Real Domain Generalized Semantic Segmentation for 3D Indoor
Point Clouds [69.64240235315864]
This paper introduces the synthetic-to-real domain generalization setting to this task.
The domain gap between synthetic and real-world point cloud data mainly lies in the different layouts and point patterns.
Experiments on the synthetic-to-real benchmark demonstrate that both CINMix and multi-prototypes can narrow the distribution gap.
arXiv Detail & Related papers (2022-12-09T05:07:43Z) - Set Based Stochastic Subsampling [85.5331107565578]
We propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an textitarbitrary downstream task network.
We show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification.
arXiv Detail & Related papers (2020-06-25T07:36:47Z) - Stochastic Sparse Subspace Clustering [20.30051592270384]
State-of-the-art subspace clustering methods are based on self-expressive model, which represents each data point as a linear combination of other data points.
We introduce dropout to address the issue of over-segmentation, which is based on randomly dropping out data points.
This leads to a scalable and flexible sparse subspace clustering approach, termed Sparse Subspace Clustering.
arXiv Detail & Related papers (2020-05-04T13:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.