Related papers: Robust learning of data anomalies with analytically-solvable entropic outlier sparsification

Robust learning of data anomalies with analytically-solvable entropic outlier sparsification

URL: http://arxiv.org/abs/2112.11768v1
Date: Wed, 22 Dec 2021 10:13:29 GMT
Title: Robust learning of data anomalies with analytically-solvable entropic outlier sparsification
Authors: Illia Horenko
Abstract summary: Outlier Sparsification (EOS) is proposed as a robust computational strategy for the detection of data anomalies. The performance of EOS is compared to a range of commonly-used tools on synthetic problems and on partially-mislabeled supervised classification problems from biomedicine.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Entropic Outlier Sparsification (EOS) is proposed as a robust computational strategy for the detection of data anomalies in a broad class of learning methods, including the unsupervised problems (like detection of non-Gaussian outliers in mostly-Gaussian data) and in the supervised learning with mislabeled data. EOS dwells on the derived analytic closed-form solution of the (weighted) expected error minimization problem subject to the Shannon entropy regularization. In contrast to common regularization strategies requiring computational costs that scale polynomial with the data dimension, identified closed-form solution is proven to impose additional iteration costs that depend linearly on statistics size and are independent of data dimension. Obtained analytic results also explain why the mixtures of spherically-symmetric Gaussians - used heuristically in many popular data analysis algorithms - represent an optimal choice for the non-parametric probability distributions when working with squared Euclidean distances, combining expected error minimality, maximal entropy/unbiasedness, and a linear cost scaling. The performance of EOS is compared to a range of commonly-used tools on synthetic problems and on partially-mislabeled supervised classification problems from biomedicine.

Related papers

Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure [1.773764539873123]
Worst-case probability measure over the data is introduced as a tool for characterizing the generalization capabilities of machine learning algorithms. Fundamental generalization metrics, such as the sensitivity of the expected loss, the sensitivity of empirical risk, and the generalization gap are shown to have closed-form expressions. A novel parallel is established between the worst-case data-generating probability measure and the Gibbs algorithm.
arXiv Detail & Related papers (2023-12-19T15:20:27Z)
Learning to Bound Counterfactual Inference in Structural Causal Models from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm. The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources. It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z)
A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z)
Low-rank statistical finite elements for scalable model-data synthesis [0.8602553195689513]
statFEM acknowledges a priori model misspecification, by embedding forcing within the governing equations. The method reconstructs the observed data-generating processes with minimal loss of information. This article overcomes this hurdle by embedding a low-rank approximation of the underlying dense covariance matrix.
arXiv Detail & Related papers (2021-09-10T09:51:43Z)
Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization. We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z)
Stochastic Approximation for Online Tensorial Independent Component Analysis [98.34292831923335]
Independent component analysis (ICA) has been a popular dimension reduction tool in statistical machine learning and signal processing. In this paper, we present a by-product online tensorial algorithm that estimates for each independent component.
arXiv Detail & Related papers (2020-12-28T18:52:37Z)
General stochastic separation theorems with optimal bounds [68.8204255655161]
Phenomenon of separability was revealed and used in machine learning to correct errors of Artificial Intelligence (AI) systems and analyze AI instabilities. Errors or clusters of errors can be separated from the rest of the data. The ability to correct an AI system also opens up the possibility of an attack on it, and the high dimensionality induces vulnerabilities caused by the same separability.
arXiv Detail & Related papers (2020-10-11T13:12:41Z)
Information Theory Measures via Multidimensional Gaussianization [7.788961560607993]
Information theory is an outstanding framework to measure uncertainty, dependence and relevance in data and systems. It has several desirable properties for real world applications. However, obtaining information from multidimensional data is a challenging problem due to the curse of dimensionality.
arXiv Detail & Related papers (2020-10-08T07:22:16Z)
Asymptotic Analysis of an Ensemble of Randomly Projected Linear Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets. We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator. We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.