Scalable High-Dimensional Multivariate Linear Regression for
Feature-Distributed Data
- URL: http://arxiv.org/abs/2307.03410v2
- Date: Mon, 11 Mar 2024 03:56:13 GMT
- Title: Scalable High-Dimensional Multivariate Linear Regression for
Feature-Distributed Data
- Authors: Shuo-Chieh Huang, Ruey S. Tsay
- Abstract summary: This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to feature-distributed data.
The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets.
We apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Feature-distributed data, referred to data partitioned by features and stored
across multiple computing nodes, are increasingly common in applications with a
large number of features. This paper proposes a two-stage relaxed greedy
algorithm (TSRGA) for applying multivariate linear regression to such data. The
main advantage of TSRGA is that its communication complexity does not depend on
the feature dimension, making it highly scalable to very large data sets. In
addition, for multivariate response variables, TSRGA can be used to yield
low-rank coefficient estimates. The fast convergence of TSRGA is validated by
simulation experiments. Finally, we apply the proposed TSRGA in a financial
application that leverages unstructured data from the 10-K reports,
demonstrating its usefulness in applications with many dense large-dimensional
matrices.
Related papers
- Scalable Bayesian Tensor Ring Factorization for Multiway Data Analysis [24.04852523970509]
We propose a novel BTR model that incorporates a nonparametric Multiplicative Gamma Process (MGP) prior.
To handle discrete data, we introduce the P'olya-Gamma augmentation for closed-form updates.
We develop an efficient Gibbs sampler for consistent posterior simulation, which reduces the computational complexity of previous VI algorithm by two orders.
arXiv Detail & Related papers (2024-12-04T13:55:14Z) - Efficient Nonparametric Tensor Decomposition for Binary and Count Data [27.02813234958821]
We propose ENTED, an underlineEfficient underlineNon underlineTEnsor underlineDecomposition for binary and count tensors.
arXiv Detail & Related papers (2024-01-15T14:27:03Z) - RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching)
To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth.
We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z) - Heterogeneous Multi-Task Gaussian Cox Processes [61.67344039414193]
We present a novel extension of multi-task Gaussian Cox processes for modeling heterogeneous correlated tasks jointly.
A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks.
We derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters.
arXiv Detail & Related papers (2023-08-29T15:01:01Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Scalable Gaussian Processes for Data-Driven Design using Big Data with
Categorical Factors [14.337297795182181]
Gaussian processes (GP) have difficulties in accommodating big datasets, categorical inputs, and multiple responses.
We propose a GP model that utilizes latent variables and functions obtained through variational inference to address the aforementioned challenges simultaneously.
Our approach is demonstrated for machine learning of ternary oxide materials and topology optimization of a multiscale compliant mechanism.
arXiv Detail & Related papers (2021-06-26T02:17:23Z) - Rank-R FNN: A Tensor-Based Learning Model for High-Order Data
Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters.
First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension.
We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z) - Deep Cellular Recurrent Network for Efficient Analysis of Time-Series
Data with Spatial Information [52.635997570873194]
This work proposes a novel deep cellular recurrent neural network (DCRNN) architecture to process complex multi-dimensional time series data with spatial information.
The proposed architecture achieves state-of-the-art performance while utilizing substantially less trainable parameters when compared to comparable methods in the literature.
arXiv Detail & Related papers (2021-01-12T20:08:18Z) - Random Sampling High Dimensional Model Representation Gaussian Process
Regression (RS-HDMR-GPR) for representing multidimensional functions with
machine-learned lower-dimensional terms allowing insight with a general
method [0.0]
Python implementation for RS-HDMR-GPR (Random Sampling High Dimensional Model Representation Gaussian Process Regression)
Code allows for imputation of missing values of the variables and for a significant pruning of the useful number of HDMR terms.
The capabilities of this regression tool are demonstrated on test cases involving synthetic analytic functions, the potential energy surface of the water molecule, kinetic energy densities of materials, and financial market data.
arXiv Detail & Related papers (2020-11-24T00:12:05Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Fast cross-validation for multi-penalty ridge regression [0.0]
Ridge regression is a simple model for high-dimensional data.
Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix.
Extensions to paired and preferential data types are included and illustrated on several cancer genomics survival prediction problems.
arXiv Detail & Related papers (2020-05-19T09:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.