Confidence Intervals for Evaluation of Data Mining
- URL: http://arxiv.org/abs/2502.07016v1
- Date: Mon, 10 Feb 2025 20:22:02 GMT
- Title: Confidence Intervals for Evaluation of Data Mining
- Authors: Zheng Yuan, Wenxin Jiang,
- Abstract summary: We consider statistical inference about general performance measures used in data mining.
We study the finite sample coverage probabilities for confidence intervals.
We also propose a blurring correction' on the variance to improve the finite sample performance.
- Score: 3.8485822412233452
- License:
- Abstract: In data mining, when binary prediction rules are used to predict a binary outcome, many performance measures are used in a vast array of literature for the purposes of evaluation and comparison. Some examples include classification accuracy, precision, recall, F measures, and Jaccard index. Typically, these performance measures are only approximately estimated from a finite dataset, which may lead to findings that are not statistically significant. In order to properly quantify such statistical uncertainty, it is important to provide confidence intervals associated with these estimated performance measures. We consider statistical inference about general performance measures used in data mining, with both individual and joint confidence intervals. These confidence intervals are based on asymptotic normal approximations and can be computed fast, without needs to do bootstrap resampling. We study the finite sample coverage probabilities for these confidence intervals and also propose a `blurring correction' on the variance to improve the finite sample performance. This 'blurring correction' generalizes the plus-four method from binomial proportion to general performance measures used in data mining. Our framework allows multiple performance measures of multiple classification rules to be inferred simultaneously for comparisons.
Related papers
- Relational Conformal Prediction for Correlated Time Series [56.59852921638328]
We propose a novel distribution-free approach based on conformal prediction framework and quantile regression.
We fill this void by introducing a novel conformal prediction method based on graph deep learning operators.
Our approach provides accurate coverage and archives state-of-the-art uncertainty quantification in relevant benchmarks.
arXiv Detail & Related papers (2025-02-13T16:12:17Z) - Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning [53.42244686183879]
Conformal prediction provides model-agnostic and distribution-free uncertainty quantification.
Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data.
We propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning.
arXiv Detail & Related papers (2024-10-13T15:37:11Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations [16.934078380644216]
We review methods for constructing confidence intervals for error rates in 1:1 matching tasks.
We show how coverage and interval width vary with sample size, error rates, and degree of data dependence.
arXiv Detail & Related papers (2023-06-01T23:23:37Z) - Statistical Inference with Stochastic Gradient Methods under
$\phi$-mixing Data [9.77185962310918]
We propose a mini-batch SGD estimator for statistical inference when the data is $phi$-mixing.
The confidence intervals are constructed using an associated mini-batch SGD procedure.
The proposed method is memory-efficient and easy to implement in practice.
arXiv Detail & Related papers (2023-02-24T16:16:43Z) - UQ-ARMED: Uncertainty quantification of adversarially-regularized mixed
effects deep learning for clustered non-iid data [0.6719751155411076]
This work demonstrates the ability to produce readily interpretable statistical metrics for model fit, fixed effects covariance coefficients, and prediction confidence.
In our experiment for AD prognosis, not only do the UQ methods provide these benefits, but several UQ methods maintain the high performance of the original ARMED method.
arXiv Detail & Related papers (2022-11-29T02:50:48Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Statistical Evaluation of Anomaly Detectors for Sequences [0.0]
We formalize a notion of precision and recall with temporal tolerance for point-based anomaly detection in sequential data.
We show how to obtain null distributions for the two measures to assess the statistical significance of reported results.
arXiv Detail & Related papers (2020-08-13T10:07:27Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.