Related papers: Recovering Imbalanced Clusters via Gradient-Based Projection Pursuit

Recovering Imbalanced Clusters via Gradient-Based Projection Pursuit

URL: http://arxiv.org/abs/2502.02668v1
Date: Tue, 04 Feb 2025 19:18:17 GMT
Title: Recovering Imbalanced Clusters via Gradient-Based Projection Pursuit
Authors: Martin Eppert, Satyaki Mukherjee, Debarghya Ghoshdastidar,
Abstract summary: We propose a method for recovering projections containing either Imbalanced Clusters or a Bernoulli-Rademacher distribution.<n>We analyze our algorithm's sample complexity within a Planted Vector setting where we can observe that Imbalanced Clusters can be recovered more easily than balanced ones.<n>We experimentally evaluate our method's applicability to real-world data using FashionMNIST and the Human Activity Recognition dataset.
Score: 7.141484637056533
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Projection Pursuit is a classic exploratory technique for finding interesting projections of a dataset. We propose a method for recovering projections containing either Imbalanced Clusters or a Bernoulli-Rademacher distribution using a gradient-based technique to optimize the projection index. As sample complexity is a major limiting factor in Projection Pursuit, we analyze our algorithm's sample complexity within a Planted Vector setting where we can observe that Imbalanced Clusters can be recovered more easily than balanced ones. Additionally, we give a generalized result that works for a variety of data distributions and projection indices. We compare these results to computational lower bounds in the Low-Degree-Polynomial Framework. Finally, we experimentally evaluate our method's applicability to real-world data using FashionMNIST and the Human Activity Recognition Dataset, where our algorithm outperforms others when only a few samples are available.

Related papers

Clustering by Attention: Leveraging Prior Fitted Transformers for Data Partitioning [3.4530027457862005]
We introduce a novel clustering approach based on meta-learning.<n>We employ a pre-trained Prior-Data Fitted Transformer Network (PFN) to perform clustering.<n>We show that our approach is superior to the state-of-the-art clustering techniques.
arXiv Detail & Related papers (2025-07-27T17:53:19Z)
ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs [22.68421108740517]
We propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm.<n> Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods.
arXiv Detail & Related papers (2025-06-12T01:53:01Z)
Nearly Optimal Sample Complexity for Learning with Label Proportions [54.67830198790247]
We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags.<n>Despite the partial observability, the goal is still to achieve small regret at the level of individual examples.<n>We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal.
arXiv Detail & Related papers (2025-05-08T15:45:23Z)
Distributed Markov Chain Monte Carlo Sampling based on the Alternating Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers. We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art. In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z)
Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction. We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z)
Efficient Failure Pattern Identification of Predictive Algorithms [15.02620042972929]
We propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm. The results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.
arXiv Detail & Related papers (2023-06-01T14:54:42Z)
Learning Rate Free Sampling in Constrained Domains [21.853333421463603]
We introduce a suite of new particle-based algorithms for sampling in constrained domains which are entirely learning rate free. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex.
arXiv Detail & Related papers (2023-05-24T09:31:18Z)
Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning [16.673022545571566]
We propose a new method for high-dimensional semi-supervised learning problems. It is based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data.
arXiv Detail & Related papers (2023-04-18T17:49:02Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Discrete-Valued Latent Preference Matrix Estimation with Graph Side Information [12.836994708337144]
We develop an algorithm that matches the optimal sample complexity. Our algorithm is robust to model errors and outperforms the existing algorithms in terms of prediction performance.
arXiv Detail & Related papers (2020-03-16T06:29:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.