An empirical comparison of some outlier detection methods with longitudinal data
- URL: http://arxiv.org/abs/2507.21203v1
- Date: Mon, 28 Jul 2025 16:06:15 GMT
- Title: An empirical comparison of some outlier detection methods with longitudinal data
- Authors: Marcello D'Orazio,
- Abstract summary: It compares well-known methods used in official statistics with proposals from the fields of data mining and machine learning.<n>This is achieved by applying the methods to panel survey data related to different types of statistical units.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This note investigates the problem of detecting outliers in longitudinal data. It compares well-known methods used in official statistics with proposals from the fields of data mining and machine learning that are based on the distance between observations or binary partitioning trees. This is achieved by applying the methods to panel survey data related to different types of statistical units. Traditional methods are quite simple, enabling the direct identification of potential outliers, but they require specific assumptions. In contrast, recent methods provide only a score whose magnitude is directly related to the likelihood of an outlier being present. All the methods require the user to set a number of tuning parameters. However, the most recent methods are more flexible and sometimes more effective than traditional methods. In addition, these methods can be applied to multidimensional data.
Related papers
- Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls [65.44462297594308]
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data.<n>Most unsupervised outlier detection methods are carefully designed to detect specified outliers.<n>We propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers.
arXiv Detail & Related papers (2025-01-06T12:35:51Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - A Bayesian approach to uncertainty in word embedding bias estimation [0.0]
Multiple measures, such as WEAT or MAC, attempt to quantify the magnitude of bias present in word embeddings in terms of a single-number metric.
We show that similar results can be easily obtained using such methods even if the data are generated by a null model lacking the intended bias.
We propose a Bayesian alternative: hierarchical Bayesian modeling, which enables a more uncertainty-sensitive inspection of bias in word embeddings.
arXiv Detail & Related papers (2023-06-15T11:48:50Z) - Auditing Fairness by Betting [43.515287900510934]
We provide practical, efficient, and nonparametric methods for auditing the fairness of deployed classification and regression models.<n>Our methods are sequential and allow for the continuous monitoring of incoming data.<n>We demonstrate the efficacy of our approach on three benchmark fairness datasets.
arXiv Detail & Related papers (2023-05-27T20:14:11Z) - A Coreset Learning Reality Check [33.002265576337486]
Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets.
In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification.
We compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness.
arXiv Detail & Related papers (2023-01-15T19:26:17Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - Approximate Bayesian Computation with Domain Expert in the Loop [13.801835670003008]
We introduce an active learning method for ABC statistics selection which reduces the domain expert's work considerably.
By involving the experts, we are able to handle misspecified models, unlike the existing dimension reduction methods.
empirical results show better posterior estimates than with existing methods, when the simulation budget is limited.
arXiv Detail & Related papers (2022-01-28T12:58:51Z) - Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic
Approach to Manifold Dimension Estimation [92.81218653234669]
We present new approach to manifold hypothesis checking and underlying manifold dimension estimation.
Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation.
Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.
arXiv Detail & Related papers (2021-07-08T15:35:54Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - DPER: Efficient Parameter Estimation for Randomly Missing Data [0.24466725954625884]
We propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set.
Our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods.
arXiv Detail & Related papers (2021-06-06T16:37:48Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Overcoming the curse of dimensionality with Laplacian regularization in
semi-supervised learning [80.20302993614594]
We provide a statistical analysis to overcome drawbacks of Laplacian regularization.
We unveil a large body of spectral filtering methods that exhibit desirable behaviors.
We provide realistic computational guidelines in order to make our method usable with large amounts of data.
arXiv Detail & Related papers (2020-09-09T14:28:54Z) - On Contrastive Learning for Likelihood-free Inference [20.49671736540948]
Likelihood-free methods perform parameter inference in simulator models where evaluating the likelihood is intractable.
One class of methods for this likelihood-free problem uses a classifier to distinguish between pairs of parameter-observation samples.
Another popular class of methods fits a conditional distribution to the parameter posterior directly, and a particular recent variant allows for the use of flexible neural density estimators.
arXiv Detail & Related papers (2020-02-10T13:14:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.