A step towards the integration of machine learning and small area
estimation
- URL: http://arxiv.org/abs/2402.07521v1
- Date: Mon, 12 Feb 2024 09:43:17 GMT
- Title: A step towards the integration of machine learning and small area
estimation
- Authors: Tomasz \.Z\k{a}d{\l}o, Adam Chwila
- Abstract summary: We propose a predictor supported by machine learning algorithms which can be used to predict any population or subpopulation characteristics.
We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well.
What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The use of machine-learning techniques has grown in numerous research areas.
Currently, it is also widely used in statistics, including the official
statistics for data collection (e.g. satellite imagery, web scraping and text
mining, data cleaning, integration and imputation) but also for data analysis.
However, the usage of these methods in survey sampling including small area
estimation is still very limited. Therefore, we propose a predictor supported
by these algorithms which can be used to predict any population or
subpopulation characteristics based on cross-sectional and longitudinal data.
Machine learning methods have already been shown to be very powerful in
identifying and modelling complex and nonlinear relationships between the
variables, which means that they have very good properties in case of strong
departures from the classic assumptions. Therefore, we analyse the performance
of our proposal under a different set-up, in our opinion of greater importance
in real-life surveys. We study only small departures from the assumed model, to
show that our proposal is a good alternative in this case as well, even in
comparison with optimal methods under the model. What is more, we propose the
method of the accuracy estimation of machine learning predictors, giving the
possibility of the accuracy comparison with classic methods, where the accuracy
is measured as in survey sampling practice. The solution of this problem is
indicated in the literature as one of the key issues in integration of these
approaches. The simulation studies are based on a real, longitudinal dataset,
freely available from the Polish Local Data Bank, where the prediction problem
of subpopulation characteristics in the last period, with "borrowing strength"
from other subpopulations and time periods, is considered.
Related papers
- Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations [49.908708778200115]
We are the first to specialize large language models (LLMs) for simulating survey response distributions.
As a testbed, we use country-level results from two global cultural surveys.
We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions.
arXiv Detail & Related papers (2025-02-10T21:59:27Z) - Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling [20.078602767179355]
Failure to properly account for errors in machine learning predictions renders standard statistical procedures invalid.
We introduce bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed.
We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
arXiv Detail & Related papers (2025-01-30T18:46:43Z) - Boosting Test Performance with Importance Sampling--a Subpopulation Perspective [16.678910111353307]
In this paper, we identify important sampling as a simple yet powerful tool for solving the subpopulation problem.
We provide a new systematic formulation of the subpopulation problem and explicitly identify the assumptions that are not clearly stated in the existing works.
On the application side, we demonstrate a single estimator is enough to solve the subpopulation problem.
arXiv Detail & Related papers (2024-12-17T15:25:24Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Multi-dimensional domain generalization with low-rank structures [18.565189720128856]
In statistical and machine learning methods, it is typically assumed that the test data are identically distributed with the training data.
This assumption does not always hold, especially in applications where the target population are not well-represented in the training data.
We present a novel approach to addressing this challenge in linear regression models.
arXiv Detail & Related papers (2023-09-18T08:07:58Z) - A Tale of Sampling and Estimation in Discounted Reinforcement Learning [50.43256303670011]
We present a minimax lower bound on the discounted mean estimation problem.
We show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties.
arXiv Detail & Related papers (2023-04-11T09:13:17Z) - Mixed moving average field guided learning for spatio-temporal data [0.0]
We define a novel Bayesian-temporal embedding and a theory-guided machine learning approach to make ensemble forecasts.
We use Lipschitz predictors to determine fixed-time and any-time PAC in the batch learning setting.
We then test the performance of our learning methodology by using linear predictors and data sets simulated from a dependence- Ornstein-Uhlenbeck process.
arXiv Detail & Related papers (2023-01-02T16:11:05Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel
Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models.
The weights are determined by a data-driven way and depend on the similarity between the corresponding functions.
We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.