A step towards the integration of machine learning and small area
estimation
- URL: http://arxiv.org/abs/2402.07521v1
- Date: Mon, 12 Feb 2024 09:43:17 GMT
- Title: A step towards the integration of machine learning and small area
estimation
- Authors: Tomasz \.Z\k{a}d{\l}o, Adam Chwila
- Abstract summary: We propose a predictor supported by machine learning algorithms which can be used to predict any population or subpopulation characteristics.
We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well.
What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The use of machine-learning techniques has grown in numerous research areas.
Currently, it is also widely used in statistics, including the official
statistics for data collection (e.g. satellite imagery, web scraping and text
mining, data cleaning, integration and imputation) but also for data analysis.
However, the usage of these methods in survey sampling including small area
estimation is still very limited. Therefore, we propose a predictor supported
by these algorithms which can be used to predict any population or
subpopulation characteristics based on cross-sectional and longitudinal data.
Machine learning methods have already been shown to be very powerful in
identifying and modelling complex and nonlinear relationships between the
variables, which means that they have very good properties in case of strong
departures from the classic assumptions. Therefore, we analyse the performance
of our proposal under a different set-up, in our opinion of greater importance
in real-life surveys. We study only small departures from the assumed model, to
show that our proposal is a good alternative in this case as well, even in
comparison with optimal methods under the model. What is more, we propose the
method of the accuracy estimation of machine learning predictors, giving the
possibility of the accuracy comparison with classic methods, where the accuracy
is measured as in survey sampling practice. The solution of this problem is
indicated in the literature as one of the key issues in integration of these
approaches. The simulation studies are based on a real, longitudinal dataset,
freely available from the Polish Local Data Bank, where the prediction problem
of subpopulation characteristics in the last period, with "borrowing strength"
from other subpopulations and time periods, is considered.
Related papers
- Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - How to Determine the Most Powerful Pre-trained Language Model without
Brute Force Fine-tuning? An Empirical Survey [23.757740341834126]
We show that H-Score generally performs well with superiorities in effectiveness and efficiency.
We also outline the difficulties of consideration of training details, applicability to text generation, and consistency to certain metrics which shed light on future directions.
arXiv Detail & Related papers (2023-12-08T01:17:28Z) - Multi-dimensional domain generalization with low-rank structures [18.565189720128856]
In statistical and machine learning methods, it is typically assumed that the test data are identically distributed with the training data.
This assumption does not always hold, especially in applications where the target population are not well-represented in the training data.
We present a novel approach to addressing this challenge in linear regression models.
arXiv Detail & Related papers (2023-09-18T08:07:58Z) - A Tale of Sampling and Estimation in Discounted Reinforcement Learning [50.43256303670011]
We present a minimax lower bound on the discounted mean estimation problem.
We show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties.
arXiv Detail & Related papers (2023-04-11T09:13:17Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Mixed moving average field guided learning for spatio-temporal data [0.0]
We define a novel Bayesian-temporal embedding and a theory-guided machine learning approach to make ensemble forecasts.
We use Lipschitz predictors to determine fixed-time and any-time PAC in the batch learning setting.
We then test the performance of our learning methodology by using linear predictors and data sets simulated from a dependence- Ornstein-Uhlenbeck process.
arXiv Detail & Related papers (2023-01-02T16:11:05Z) - The Lifecycle of a Statistical Model: Model Failure Detection,
Identification, and Refitting [26.351782287953267]
We develop tools and theory for detecting and identifying regions of the covariate space (subpopulations) where model performance has begun to degrade.
We present empirical results with three real-world data sets.
We complement these empirical results with theory proving that our methodology is minimax optimal for recovering anomalous subpopulations.
arXiv Detail & Related papers (2022-02-08T22:02:31Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel
Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models.
The weights are determined by a data-driven way and depend on the similarity between the corresponding functions.
We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.