To BEE or not to BEE: Estimating more than Entropy with Biased Entropy Estimators
- URL: http://arxiv.org/abs/2501.11395v1
- Date: Mon, 20 Jan 2025 10:48:08 GMT
- Title: To BEE or not to BEE: Estimating more than Entropy with Biased Entropy Estimators
- Authors: Ilaria Pia la Torre, David A. Kelly, Hector D. Menendez, David Clark,
- Abstract summary: We apply 18 widely employed entropy estimators to Shannon measures useful to the software engineer.
We investigate how the estimators are affected by two main influential factors: sample size and domain size.
Our most important result is identifying that the Chao-Shen and Chao-Wang-Jost estimators stand out for consistently converging more quickly to the ground truth.
- Score: 0.3669506968635671
- License:
- Abstract: Entropy estimation plays a significant role in biology, economics, physics, communication engineering and other disciplines. It is increasingly used in software engineering, e.g. in software confidentiality, software testing, predictive analysis, machine learning, and software improvement. However accurate estimation is demonstrably expensive in many contexts, including software. Statisticians have consequently developed biased estimators that aim to accurately estimate entropy on the basis of a sample. In this paper we apply 18 widely employed entropy estimators to Shannon measures useful to the software engineer: entropy, mutual information and conditional mutual information. Moreover, we investigate how the estimators are affected by two main influential factors: sample size and domain size. Our experiments range over a large set of randomly generated joint probability distributions and varying sample sizes, rather than choosing just one or two well known probability distributions as in previous investigations. Our most important result is identifying that the Chao-Shen and Chao-Wang-Jost estimators stand out for consistently converging more quickly to the ground truth, regardless of domain size and regardless of the measure used. They also tend to outperform the others in terms of accuracy as sample sizes increase. This discovery enables a significant reduction in data collection effort without compromising performance.
Related papers
- confidence-planner: Easy-to-Use Prediction Confidence Estimation and
Sample Size Planning [3.0969191504482247]
We present an easy-to-use python package and web application for estimating prediction confidence intervals.
The package offers eight different procedures to determine and justify the sample size and confidence of predictions.
arXiv Detail & Related papers (2023-01-12T14:49:59Z) - ZigZag: Universal Sampling-free Uncertainty Estimation Through Two-Step Inference [54.17205151960878]
We introduce a sampling-free approach that is generic and easy to deploy.
We produce reliable uncertainty estimates on par with state-of-the-art methods at a significantly lower computational cost.
arXiv Detail & Related papers (2022-11-21T13:23:09Z) - Estimating the Entropy of Linguistic Distributions [75.20045001387685]
We study the empirical effectiveness of different entropy estimators for linguistic distributions.
We find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators.
arXiv Detail & Related papers (2022-04-04T13:36:46Z) - On Variance Estimation of Random Forests [0.0]
This paper develops an unbiased variance estimator based on incomplete U-statistics.
We show that our estimators enjoy lower bias and more accurate confidence interval coverage without additional computational costs.
arXiv Detail & Related papers (2022-02-18T03:35:47Z) - Expected Validation Performance and Estimation of a Random Variable's
Maximum [48.83713377993604]
We analyze three statistical estimators for expected validation performance.
We find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias.
We find that the two biased estimators lead to the fewest incorrect conclusions.
arXiv Detail & Related papers (2021-10-01T18:48:47Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Neural Joint Entropy Estimation [12.77733789371855]
Estimating the entropy of a discrete random variable is a fundamental problem in information theory and related fields.
In this work, we introduce a practical solution to this problem, which extends the work of McAllester and Statos ( 2020)
The proposed scheme uses the generalization abilities of cross-entropy estimation in deep neural networks (DNNs) to introduce improved entropy estimation accuracy.
arXiv Detail & Related papers (2020-12-21T09:23:39Z) - MASSIVE: Tractable and Robust Bayesian Learning of Many-Dimensional
Instrumental Variable Models [8.271859911016719]
We propose a general and efficient causal inference algorithm that accounts for model uncertainty.
We show that, as long as some of the candidates are (close to) valid, without knowing a priori which ones, they collectively still pose enough restrictions on the target interaction to obtain a reliable causal effect estimate.
arXiv Detail & Related papers (2020-12-18T10:06:55Z) - A Robust Test for Elliptical Symmetry [2.030567625639093]
Ellipticity GoF tests are usually hard to analyze and often their statistical power is not particularly strong.
We develop a novel framework based on the exchangeable random variables calculus introduced by de Finetti.
arXiv Detail & Related papers (2020-06-05T08:51:16Z) - Showing Your Work Doesn't Always Work [73.63200097493576]
"Show Your Work: Improved Reporting of Experimental Results" advocates for reporting the expected validation effectiveness of the best-tuned model.
We analytically show that their estimator is biased and uses error-prone assumptions.
We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation.
arXiv Detail & Related papers (2020-04-28T17:59:01Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.