Related papers: An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression

An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression

URL: http://arxiv.org/abs/2308.12691v1
Date: Thu, 24 Aug 2023 10:20:15 GMT
Title: An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression
Authors: Bohan Lyu and Jianzhong Li
Abstract summary: This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR) The proposed data analysis method is shown to be more efficient and flexible than other regression based methods.
Score: 4.085654010023149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR), which separates input datasets into subsets and construct local linear regression models of them. The proposed data analysis method is shown to be more efficient and flexible than other regression based methods. This paper also proposes an approximate algorithm to construct MMLR models based on $(\epsilon,\delta)$-estimator, and gives mathematical proofs of the correctness and efficiency of MMLR algorithm, of which the time complexity is linear with respect to the size of input datasets. This paper also empirically implements the method on both synthetic and real-world datasets, the algorithm shows to have comparable performance to existing regression methods in many cases, while it takes almost the shortest time to provide a high prediction accuracy.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
Multivariate regression modeling in integrative analysis via sparse regularization [0.0]
Integrative analysis is an effective method to pool useful information from multiple independent datasets. The integration is achieved by sparse estimation that performs variable and group selection. The performance of the proposed method is demonstrated through Monte Carlo simulation and analyzing wastewater treatment data with microbe measurements.
arXiv Detail & Related papers (2023-04-15T02:27:51Z)
An adaptive shortest-solution guided decimation approach to sparse high-dimensional linear regression [2.3759847811293766]
ASSD is adapted from the shortest solution-guided algorithm and is referred to as ASSD. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.
arXiv Detail & Related papers (2022-11-28T04:29:57Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates. The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z)
Easy Differentially Private Linear Regression [16.325734286930764]
We study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. We find that this algorithm obtains strong empirical performance in the data-rich setting.
arXiv Detail & Related papers (2022-08-15T17:42:27Z)
Test Set Sizing Via Random Matrix Theory [91.3755431537592]
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise. This paper is the first to solve for the training and test size for any model in a way that is truly optimal.
arXiv Detail & Related papers (2021-12-11T13:18:33Z)
Evaluation of Tree Based Regression over Multiple Linear Regression for Non-normally Distributed Data in Battery Performance [0.5735035463793008]
This study explores the impact of data normality in building machine learning models. Tree-based regression models and multiple linear regressions models are each built from a highly skewed non-normal dataset.
arXiv Detail & Related papers (2021-11-03T20:28:24Z)
Piecewise linear regression and classification [0.20305676256390928]
This paper proposes a method for solving multivariate regression and classification problems using piecewise linear predictors. A Python implementation of the algorithm described in this paper is available at http://cse.lab.imtlucca.it/bemporad/parc.
arXiv Detail & Related papers (2021-03-10T17:07:57Z)
A Hypergradient Approach to Robust Regression without Correspondence [85.49775273716503]
We consider a variant of regression problem, where the correspondence between input and output data is not available. Most existing methods are only applicable when the sample size is small. We propose a new computational framework -- ROBOT -- for the shuffled regression problem.
arXiv Detail & Related papers (2020-11-30T21:47:38Z)
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets. We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.