Outlier Detection and Data Clustering via Innovation Search
- URL: http://arxiv.org/abs/1912.12988v1
- Date: Mon, 30 Dec 2019 16:29:04 GMT
- Title: Outlier Detection and Data Clustering via Innovation Search
- Authors: Mostafa Rahmani and Ping Li
- Abstract summary: We present a new discovery that the directions of innovation can be used to design a robust PCA method.
The proposed approach, dubbed iSearch, uses the direction search optimization problem to compute an optimal direction corresponding to each data point.
- Score: 27.107601048639637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The idea of Innovation Search was proposed as a data clustering method in
which the directions of innovation were utilized to compute the adjacency
matrix and it was shown that Innovation Pursuit can notably outperform the self
representation based subspace clustering methods. In this paper, we present a
new discovery that the directions of innovation can be used to design a
provable and strong robust (to outlier) PCA method. The proposed approach,
dubbed iSearch, uses the direction search optimization problem to compute an
optimal direction corresponding to each data point. iSearch utilizes the
directions of innovation to measure the innovation of the data points and it
identifies the outliers as the most innovative data points. Analytical
performance guarantees are derived for the proposed robust PCA method under
different models for the distribution of the outliers including randomly
distributed outliers, clustered outliers, and linearly dependent outliers. In
addition, we study the problem of outlier detection in a union of subspaces and
it is shown that iSearch provably recovers the span of the inliers when the
inliers lie in a union of subspaces. Moreover, we present theoretical studies
which show that the proposed measure of innovation remains stable in the
presence of noise and the performance of iSearch is robust to noisy data. In
the challenging scenarios in which the outliers are close to each other or they
are close to the span of the inliers, iSearch is shown to remarkably outperform
most of the existing methods. The presented method shows that the directions of
innovation are useful representation of the data which can be used to perform
both data clustering and outlier detection.
Related papers
- Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance.
Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws.
We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - Study Features via Exploring Distribution Structure [9.596923373834093]
We present a novel framework for data redundancy measurement based on probabilistic modeling of datasets, and a new criterion for redundancy detection that is resilient to noise.
Our framework is flexible and can handle different types of features, and our experiments on benchmark datasets demonstrate the effectiveness of our methods.
arXiv Detail & Related papers (2024-01-15T09:01:31Z) - Diversified Outlier Exposure for Out-of-Distribution Detection via
Informative Extrapolation [110.34982764201689]
Out-of-distribution (OOD) detection is important for deploying reliable machine learning models on real-world applications.
Recent advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers.
We propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers.
arXiv Detail & Related papers (2023-10-21T07:16:09Z) - STEERING: Stein Information Directed Exploration for Model-Based
Reinforcement Learning [111.75423966239092]
We propose an exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal.
Based on KSD, we develop a novel algorithm algo: textbfSTEin information dirtextbfEcted exploration for model-based textbfReinforcement LearntextbfING.
arXiv Detail & Related papers (2023-01-28T00:49:28Z) - Hub-VAE: Unsupervised Hub-based Regularization of Variational
Autoencoders [11.252245456934348]
We propose an unsupervised, data-driven regularization of the latent space with a mixture of hub-based priors and a hub-based contrastive loss.
Our algorithm achieves superior cluster separability in the embedding space, and accurate data reconstruction and generation.
arXiv Detail & Related papers (2022-11-18T19:12:15Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Score matching enables causal discovery of nonlinear additive noise
models [63.93669924730725]
We show how to design a new generation of scalable causal discovery methods.
We propose a new efficient method for approximating the score's Jacobian, enabling to recover the causal graph.
arXiv Detail & Related papers (2022-03-08T21:34:46Z) - Better Modelling Out-of-Distribution Regression on Distributed Acoustic
Sensor Data Using Anchored Hidden State Mixup [0.7455546102930911]
Generalizing the application of machine learning models to situations where the statistical distribution of training and test data are different has been a complex problem.
We introduce an anchored-based Out of Distribution (OOD) Regression Mixup algorithm, leveraging manifold hidden state mixup and observation similarities to form a novel regularization penalty.
We demonstrate with an extensive evaluation the generalization performance of the proposed method against existing approaches, then show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-02-23T03:12:21Z) - Learning Bias-Invariant Representation by Cross-Sample Mutual
Information Minimization [77.8735802150511]
We propose a cross-sample adversarial debiasing (CSAD) method to remove the bias information misused by the target task.
The correlation measurement plays a critical role in adversarial debiasing and is conducted by a cross-sample neural mutual information estimator.
We conduct thorough experiments on publicly available datasets to validate the advantages of the proposed method over state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-11T21:17:02Z) - Closed-Form, Provable, and Robust PCA via Leverage Statistics and
Innovation Search [25.229137979402584]
We study the Innovation Values computed by the Innovation Search algorithm under a quadratic cost function.
It is proved that Innovation Values with the new cost function are equivalent to Leverage Scores.
This interesting connection is utilized to establish several theoretical guarantees for a Leverage Score based robust PCA method.
arXiv Detail & Related papers (2021-06-23T06:36:36Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Robust Locality-Aware Regression for Labeled Data Classification [5.432221650286726]
We propose a new discriminant feature extraction framework, namely Robust Locality-Aware Regression (RLAR)
In our model, we introduce a retargeted regression to perform the marginal representation learning adaptively instead of using the general average inter-class margin.
To alleviate the disturbance of outliers and prevent overfitting, we measure the regression term and locality-aware term together with the regularization term by the L2,1 norm.
arXiv Detail & Related papers (2020-06-15T11:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.