An Efficient Data Analysis Method for Big Data using Multiple-Model
Linear Regression
- URL: http://arxiv.org/abs/2308.12691v1
- Date: Thu, 24 Aug 2023 10:20:15 GMT
- Title: An Efficient Data Analysis Method for Big Data using Multiple-Model
Linear Regression
- Authors: Bohan Lyu and Jianzhong Li
- Abstract summary: This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR)
The proposed data analysis method is shown to be more efficient and flexible than other regression based methods.
- Score: 4.085654010023149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a new data analysis method for big data using a newly
defined regression model named multiple model linear regression(MMLR), which
separates input datasets into subsets and construct local linear regression
models of them. The proposed data analysis method is shown to be more efficient
and flexible than other regression based methods. This paper also proposes an
approximate algorithm to construct MMLR models based on
$(\epsilon,\delta)$-estimator, and gives mathematical proofs of the correctness
and efficiency of MMLR algorithm, of which the time complexity is linear with
respect to the size of input datasets. This paper also empirically implements
the method on both synthetic and real-world datasets, the algorithm shows to
have comparable performance to existing regression methods in many cases, while
it takes almost the shortest time to provide a high prediction accuracy.
Related papers
- Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Multivariate regression modeling in integrative analysis via sparse
regularization [0.0]
Integrative analysis is an effective method to pool useful information from multiple independent datasets.
The integration is achieved by sparse estimation that performs variable and group selection.
The performance of the proposed method is demonstrated through Monte Carlo simulation and analyzing wastewater treatment data with microbe measurements.
arXiv Detail & Related papers (2023-04-15T02:27:51Z) - An adaptive shortest-solution guided decimation approach to sparse
high-dimensional linear regression [2.3759847811293766]
ASSD is adapted from the shortest solution-guided algorithm and is referred to as ASSD.
ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.
arXiv Detail & Related papers (2022-11-28T04:29:57Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Easy Differentially Private Linear Regression [16.325734286930764]
We study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models.
We find that this algorithm obtains strong empirical performance in the data-rich setting.
arXiv Detail & Related papers (2022-08-15T17:42:27Z) - Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression.
It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise.
This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z) - Evaluation of Tree Based Regression over Multiple Linear Regression for
Non-normally Distributed Data in Battery Performance [0.5735035463793008]
This study explores the impact of data normality in building machine learning models.
Tree-based regression models and multiple linear regressions models are each built from a highly skewed non-normal dataset.
arXiv Detail & Related papers (2021-11-03T20:28:24Z) - Piecewise linear regression and classification [0.20305676256390928]
This paper proposes a method for solving multivariate regression and classification problems using piecewise linear predictors.
A Python implementation of the algorithm described in this paper is available at http://cse.lab.imtlucca.it/bemporad/parc.
arXiv Detail & Related papers (2021-03-10T17:07:57Z) - A Hypergradient Approach to Robust Regression without Correspondence [85.49775273716503]
We consider a variant of regression problem, where the correspondence between input and output data is not available.
Most existing methods are only applicable when the sample size is small.
We propose a new computational framework -- ROBOT -- for the shuffled regression problem.
arXiv Detail & Related papers (2020-11-30T21:47:38Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.