Two-level histograms for dealing with outliers and heavy tail
distributions
- URL: http://arxiv.org/abs/2306.05786v1
- Date: Fri, 9 Jun 2023 09:57:18 GMT
- Title: Two-level histograms for dealing with outliers and heavy tail
distributions
- Authors: Marc Boull\'e
- Abstract summary: We focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without any user parameter.
We investigate on the limits of this method in the case of outliers or heavy-tailed distributions.
The first level exploits a logarithmic transformation of the data to split the data set into a list of data subsets with a controlled range of values.
The second level builds a sub-histogram for each data subset and aggregates them to obtain a complete histogram.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Histograms are among the most popular methods used in exploratory analysis to
summarize univariate distributions. In particular, irregular histograms are
good non-parametric density estimators that require very few parameters: the
number of bins with their lengths and frequencies. Many approaches have been
proposed in the literature to infer these parameters, either assuming
hypotheses about the underlying data distributions or exploiting a model
selection approach. In this paper, we focus on the G-Enum histogram method,
which exploits the Minimum Description Length (MDL) principle to build
histograms without any user parameter and achieves state-of-the art performance
w.r.t accuracy; parsimony and computation time. We investigate on the limits of
this method in the case of outliers or heavy-tailed distributions. We suggest a
two-level heuristic to deal with such cases. The first level exploits a
logarithmic transformation of the data to split the data set into a list of
data subsets with a controlled range of values. The second level builds a
sub-histogram for each data subset and aggregates them to obtain a complete
histogram. Extensive experiments show the benefits of the approach.
Related papers
- Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Fast and fully-automated histograms for large-scale data sets [0.0]
G-Enum histograms are a new fast and fully automated method for irregular histogram construction.
They leverage the Minimum Description Length principle to derive two different model selection criteria.
arXiv Detail & Related papers (2022-12-27T15:37:10Z) - Multiclass histogram-based thresholding using kernel density estimation
and scale-space representations [0.0]
We present a new method for multiclass thresholding of a histogram based on the nonparametric Kernel Density (KD) estimation.
The method compares the number of extracted minima of the KD estimate with the number of the requested clusters minus one.
We verify the method using synthetic histograms with known threshold values and using the histogram of real X-ray computed tomography images.
arXiv Detail & Related papers (2022-02-10T01:03:43Z) - ECOD: Unsupervised Outlier Detection Using Empirical Cumulative
Distribution Functions [12.798256312657136]
Outlier detection refers to the identification of data points that deviate from a general data distribution.
We present ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution.
arXiv Detail & Related papers (2022-01-02T17:28:35Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - The Earth Mover's Pinball Loss: Quantiles for Histogram-Valued
Regression [0.0]
We present a dedicated method for Deep Learning-based histogram regression, which incorporates cross-bin information and yields over possible histograms.
We validate our method with an illustrative toy example, a football-related task, and an astrophysical computer vision problem.
arXiv Detail & Related papers (2021-06-03T18:00:04Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Block-Approximated Exponential Random Graphs [77.4792558024487]
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs.
We propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions.
Our methods are scalable to sparse graphs consisting of millions of nodes.
arXiv Detail & Related papers (2020-02-14T11:42:16Z) - Sparse Density Trees and Lists: An Interpretable Alternative to
High-Dimensional Histograms [19.134568072720956]
We present tree-based and list-based density estimation methods for binary/categorical data.
Our density estimation models are higher dimensional analogies to variable bin width histograms.
We present an application to crime analysis, where we estimate how unusual each type of modus operandi is for a house break-in.
arXiv Detail & Related papers (2015-10-22T22:29:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.