A parametric distribution for exact post-selection inference with data
carving
- URL: http://arxiv.org/abs/2305.12581v1
- Date: Sun, 21 May 2023 22:29:55 GMT
- Title: A parametric distribution for exact post-selection inference with data
carving
- Authors: Erik Drysdale
- Abstract summary: Post-selection inference (PoSI) is a technique for obtaining valid confidence intervals and p-values when hypothesis generation and testing use the same source of data.
Data carving is a variant of PoSI in which a portion of held out data is combined with the hypothesis generating data at inference time.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-selection inference (PoSI) is a statistical technique for obtaining
valid confidence intervals and p-values when hypothesis generation and testing
use the same source of data. PoSI can be used on a range of popular algorithms
including the Lasso. Data carving is a variant of PoSI in which a portion of
held out data is combined with the hypothesis generating data at inference
time. While data carving has attractive theoretical and empirical properties,
existing approaches rely on computationally expensive MCMC methods to carry out
inference. This paper's key contribution is to show that pivotal quantities can
be constructed for the data carving procedure based on a known parametric
distribution. Specifically, when the selection event is characterized by a set
of polyhedral constraints on a Gaussian response, data carving will follow the
sum of a normal and a truncated normal (SNTN), which is a variant of the
truncated bivariate normal distribution. The main impact of this insight is
that obtaining exact inference for data carving can be made computationally
trivial, since the CDF of the SNTN distribution can be found using the CDF of a
standard bivariate normal. A python package sntn has been released to further
facilitate the adoption of data carving with PoSI.
Related papers
- Bayesian Estimation and Tuning-Free Rank Detection for Probability Mass Function Tensors [17.640500920466984]
This paper presents a novel framework for estimating the joint PMF and automatically inferring its rank from observed data.
We derive a deterministic solution based on variational inference (VI) to approximate the posterior distributions of various model parameters. Additionally, we develop a scalable version of the VI-based approach by leveraging variational inference (SVI)
Experiments involving both synthetic data and real movie recommendation data illustrate the advantages of our VI and SVI-based methods in terms of estimation accuracy, automatic rank detection, and computational efficiency.
arXiv Detail & Related papers (2024-10-08T20:07:49Z) - Inference in Randomized Least Squares and PCA via Normality of Quadratic Forms [19.616162116973637]
We develop a unified methodology for statistical inference via randomized sketching or projections.
The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to the randomized algorithm.
arXiv Detail & Related papers (2024-04-01T04:35:44Z) - Bayesian Renormalization [68.8204255655161]
We present a fully information theoretic approach to renormalization inspired by Bayesian statistical inference.
The main insight of Bayesian Renormalization is that the Fisher metric defines a correlation length that plays the role of an emergent RG scale.
We provide insight into how the Bayesian Renormalization scheme relates to existing methods for data compression and data generation.
arXiv Detail & Related papers (2023-05-17T18:00:28Z) - On Calibrating Diffusion Probabilistic Models [78.75538484265292]
diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks.
We propose a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can be increased.
Our calibration method is performed only once and the resulting models can be used repeatedly for sampling.
arXiv Detail & Related papers (2023-02-21T14:14:40Z) - Data thinning for convolution-closed distributions [2.299914829977005]
We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation.
We show that data thinning can be used to validate the results of unsupervised learning approaches.
arXiv Detail & Related papers (2023-01-18T02:47:41Z) - FaDIn: Fast Discretized Inference for Hawkes Processes with General
Parametric Kernels [82.53569355337586]
This work offers an efficient solution to temporal point processes inference using general parametric kernels with finite support.
The method's effectiveness is evaluated by modeling the occurrence of stimuli-induced patterns from brain signals recorded with magnetoencephalography (MEG)
Results show that the proposed approach leads to an improved estimation of pattern latency than the state-of-the-art.
arXiv Detail & Related papers (2022-10-10T12:35:02Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - ECOD: Unsupervised Outlier Detection Using Empirical Cumulative
Distribution Functions [12.798256312657136]
Outlier detection refers to the identification of data points that deviate from a general data distribution.
We present ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution.
arXiv Detail & Related papers (2022-01-02T17:28:35Z) - Optimal regularizations for data generation with probabilistic graphical
models [0.0]
Empirically, well-chosen regularization schemes dramatically improve the quality of the inferred models.
We consider the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models.
arXiv Detail & Related papers (2021-12-02T14:45:16Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.