Related papers: A step towards the integration of machine learning and small area estimation

A step towards the integration of machine learning and small area estimation

URL: http://arxiv.org/abs/2402.07521v1
Date: Mon, 12 Feb 2024 09:43:17 GMT
Title: A step towards the integration of machine learning and small area estimation
Authors: Tomasz \.Z\k{a}d{\l}o, Adam Chwila
Abstract summary: We propose a predictor supported by machine learning algorithms which can be used to predict any population or subpopulation characteristics. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The use of machine-learning techniques has grown in numerous research areas. Currently, it is also widely used in statistics, including the official statistics for data collection (e.g. satellite imagery, web scraping and text mining, data cleaning, integration and imputation) but also for data analysis. However, the usage of these methods in survey sampling including small area estimation is still very limited. Therefore, we propose a predictor supported by these algorithms which can be used to predict any population or subpopulation characteristics based on cross-sectional and longitudinal data. Machine learning methods have already been shown to be very powerful in identifying and modelling complex and nonlinear relationships between the variables, which means that they have very good properties in case of strong departures from the classic assumptions. Therefore, we analyse the performance of our proposal under a different set-up, in our opinion of greater importance in real-life surveys. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well, even in comparison with optimal methods under the model. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods, where the accuracy is measured as in survey sampling practice. The solution of this problem is indicated in the literature as one of the key issues in integration of these approaches. The simulation studies are based on a real, longitudinal dataset, freely available from the Polish Local Data Bank, where the prediction problem of subpopulation characteristics in the last period, with "borrowing strength" from other subpopulations and time periods, is considered.

Related papers

Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations [49.908708778200115]
We are the first to specialize large language models (LLMs) for simulating survey response distributions. As a testbed, we use country-level results from two global cultural surveys. We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions.
arXiv Detail & Related papers (2025-02-10T21:59:27Z)
Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling [20.078602767179355]
Failure to properly account for errors in machine learning predictions renders standard statistical procedures invalid. We introduce bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
arXiv Detail & Related papers (2025-01-30T18:46:43Z)
Boosting Test Performance with Importance Sampling--a Subpopulation Perspective [16.678910111353307]
In this paper, we identify important sampling as a simple yet powerful tool for solving the subpopulation problem. We provide a new systematic formulation of the subpopulation problem and explicitly identify the assumptions that are not clearly stated in the existing works. On the application side, we demonstrate a single estimator is enough to solve the subpopulation problem.
arXiv Detail & Related papers (2024-12-17T15:25:24Z)
Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL) It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights. The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z)
Machine Learning for Predicting Chaotic Systems [0.0]
Predicting chaotic dynamical systems is critical in many scientific fields, such as weather forecasting.<n>In this paper, we compare different lightweight and heavyweight machine learning architectures.<n>We introduce the cumulative maximum error, a novel metric that combines desirable properties of traditional metrics and is tailored for chaotic systems.
arXiv Detail & Related papers (2024-07-29T16:34:47Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
How to Determine the Most Powerful Pre-trained Language Model without Brute Force Fine-tuning? An Empirical Survey [23.757740341834126]
We show that H-Score generally performs well with superiorities in effectiveness and efficiency. We also outline the difficulties of consideration of training details, applicability to text generation, and consistency to certain metrics which shed light on future directions.
arXiv Detail & Related papers (2023-12-08T01:17:28Z)
Multi-dimensional domain generalization with low-rank structures [18.565189720128856]
In statistical and machine learning methods, it is typically assumed that the test data are identically distributed with the training data. This assumption does not always hold, especially in applications where the target population are not well-represented in the training data. We present a novel approach to addressing this challenge in linear regression models.
arXiv Detail & Related papers (2023-09-18T08:07:58Z)
A Tale of Sampling and Estimation in Discounted Reinforcement Learning [50.43256303670011]
We present a minimax lower bound on the discounted mean estimation problem. We show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties.
arXiv Detail & Related papers (2023-04-11T09:13:17Z)
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)
In Search of Insights, Not Magic Bullets: Towards Demystification of the Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z)
Mixed moving average field guided learning for spatio-temporal data [0.0]
We define a novel Bayesian-temporal embedding and a theory-guided machine learning approach to make ensemble forecasts. We use Lipschitz predictors to determine fixed-time and any-time PAC in the batch learning setting. We then test the performance of our learning methodology by using linear predictors and data sets simulated from a dependence- Ornstein-Uhlenbeck process.
arXiv Detail & Related papers (2023-01-02T16:11:05Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
The Lifecycle of a Statistical Model: Model Failure Detection, Identification, and Refitting [26.351782287953267]
We develop tools and theory for detecting and identifying regions of the covariate space (subpopulations) where model performance has begun to degrade. We present empirical results with three real-world data sets. We complement these empirical results with theory proving that our methodology is minimax optimal for recovering anomalous subpopulations.
arXiv Detail & Related papers (2022-02-08T22:02:31Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Predictive machine learning for prescriptive applications: a coupled training-validating approach [77.34726150561087]
We propose a new method for training predictive machine learning models for prescriptive applications. This approach is based on tweaking the validation step in the standard training-validating-testing scheme. Several experiments with synthetic data demonstrate promising results in reducing the prescription costs in both deterministic and real models.
arXiv Detail & Related papers (2021-10-22T15:03:20Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)
Detangling robustness in high dimensions: composite versus model-averaged estimation [11.658462692891355]
Robust methods, though ubiquitous in practice, are yet to be fully understood in the context of regularized estimation and high dimensions. This paper provides a toolbox to further study robustness in these settings and focuses on prediction.
arXiv Detail & Related papers (2020-06-12T20:40:15Z)
Learning the Truth From Only One Side of the Story [58.65439277460011]
We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge suboptimally or even fail to converge to the optimal solution. We propose an adaptive approach that comes with theoretical guarantees and show that it outperforms several existing methods empirically.
arXiv Detail & Related papers (2020-06-08T18:20:28Z)
Marginal likelihood computation for model selection and hypothesis testing: an extensive review [66.37504201165159]
This article provides a comprehensive study of the state-of-the-art of the topic. We highlight limitations, benefits, connections and differences among the different techniques. Problems and possible solutions with the use of improper priors are also described.
arXiv Detail & Related papers (2020-05-17T18:31:58Z)
Machine learning for causal inference: on the use of cross-fit estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties. We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE) When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
Design-unbiased statistical learning in survey sampling [0.0]
We propose a subsampling Rao-Blackwell method, and develop a statistical learning theory for exactly design-unbiased estimation. Our approach makes use of classic ideas from Statistical Science as well as the rapidly growing field of Machine Learning.
arXiv Detail & Related papers (2020-03-25T14:27:39Z)
Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models. The weights are determined by a data-driven way and depend on the similarity between the corresponding functions. We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.