Divide-and-conquer methods for big data analysis
- URL: http://arxiv.org/abs/2102.10771v1
- Date: Mon, 22 Feb 2021 04:40:55 GMT
- Title: Divide-and-conquer methods for big data analysis
- Authors: Xueying Chen, Jerry Q. Cheng, Min-ge Xie
- Abstract summary: Divide-and-conquer methodology refers to a multiple-step process.
This article reviews some recently developments of divide-and-conquer methods in a variety of settings.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the context of big data analysis, the divide-and-conquer methodology
refers to a multiple-step process: first splitting a data set into several
smaller ones; then analyzing each set separately; finally combining results
from each analysis together. This approach is effective in handling large data
sets that are unsuitable to be analyzed entirely by a single computer due to
limits either from memory storage or computational time. The combined results
will provide a statistical inference which is similar to the one from analyzing
the entire data set. This article reviews some recently developments of
divide-and-conquer methods in a variety of settings, including combining based
on parametric, semiparametric and nonparametric models, online sequential
updating methods, among others. Theoretical development on the efficiency of
the divide-and-conquer methods is also discussed.
Related papers
- Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System [48.093356587573666]
Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions.<n>Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction.<n>We propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls.
arXiv Detail & Related papers (2025-05-22T07:25:31Z) - Scaling Inter-procedural Dataflow Analysis on the Cloud [19.562864760293955]
We develop a distributed framework called BigDataflow running on a large-scale cluster.
BigDataflow can finish analyzing the program of millions lines of code in minutes.
arXiv Detail & Related papers (2024-12-17T06:18:56Z) - A Closer Look at Deep Learning on Tabular Data [52.50778536274327]
Tabular data is prevalent across various domains in machine learning.
Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Bayesian Federated Inference for regression models based on non-shared multicenter data sets from heterogeneous populations [0.0]
In a regression model, the sample size must be large enough relative to the number of possible predictors.
Pooling data from different data sets collected in different (medical) centers would alleviate this problem, but is often not feasible due to privacy regulation or logistic problems.
An alternative route would be to analyze the local data in the centers separately and combine the statistical inference results with the Bayesian Federated Inference (BFI) methodology.
The aim of this approach is to compute from the inference results in separate centers what would have been found if the statistical analysis was performed on the combined data.
arXiv Detail & Related papers (2024-02-05T11:10:27Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Multivariate regression modeling in integrative analysis via sparse
regularization [0.0]
Integrative analysis is an effective method to pool useful information from multiple independent datasets.
The integration is achieved by sparse estimation that performs variable and group selection.
The performance of the proposed method is demonstrated through Monte Carlo simulation and analyzing wastewater treatment data with microbe measurements.
arXiv Detail & Related papers (2023-04-15T02:27:51Z) - Leachable Component Clustering [10.377914682543903]
In this work, a novel approach to clustering of incomplete data, termed leachable component clustering, is proposed.
The proposed method handles data imputation with Bayes alignment, and collects the lost patterns in theory.
Experiments on several artificial incomplete data sets demonstrate that, the proposed method is able to present superior performance compared with other state-of-the-art algorithms.
arXiv Detail & Related papers (2022-08-28T13:13:17Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Probabilistic methods for approximate archetypal analysis [8.829245587252435]
Archetypal analysis is an unsupervised learning method for exploratory data analysis.
We introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data.
We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.
arXiv Detail & Related papers (2021-08-12T14:27:11Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Distributed Learning of Finite Gaussian Mixtures [21.652015112462]
We study split-and-conquer approaches for the distributed learning of finite Gaussian mixtures.
New estimator is shown to be consistent and retains root-n consistency under some general conditions.
Experiments based on simulated and real-world data show that the proposed split-and-conquer approach has comparable statistical performance with the global estimator.
arXiv Detail & Related papers (2020-10-20T16:17:47Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - A General Method for Robust Learning from Batches [56.59844655107251]
We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains.
We derive the first robust computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.
arXiv Detail & Related papers (2020-02-25T18:53:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.