Distributed Principal Subspace Analysis for Partitioned Big Data:
Algorithms, Analysis, and Implementation
- URL: http://arxiv.org/abs/2103.06406v1
- Date: Thu, 11 Mar 2021 01:33:38 GMT
- Title: Distributed Principal Subspace Analysis for Partitioned Big Data:
Algorithms, Analysis, and Implementation
- Authors: Bingqing Xiang, Arpita Gang, and Waheed U. Bajwa
- Abstract summary: Principal Subspace Analysis (PSA) is one of the most popular approaches for dimensionality reduction in signal processing and machine learning.
centralized PSA solutions are fast becoming irrelevant in the modern era of big data.
This paper revisits the problem of distributed PSA under the general framework of an arbitrarily connected network of machines.
- Score: 9.730443503568804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Principal Subspace Analysis (PSA) is one of the most popular approaches for
dimensionality reduction in signal processing and machine learning. But
centralized PSA solutions are fast becoming irrelevant in the modern era of big
data, in which the number of samples and/or the dimensionality of samples often
exceed the storage and/or computational capabilities of individual machines.
This has led to study of distributed PSA solutions, in which the data are
partitioned across multiple machines and an estimate of the principal subspace
is obtained through collaboration among the machines. It is in this vein that
this paper revisits the problem of distributed PSA under the general framework
of an arbitrarily connected network of machines that lacks a central server.
The main contributions of the paper in this regard are threefold. First, two
algorithms are proposed in the paper that can be used for distributed PSA in
the case of data that are partitioned across either samples or (raw) features.
Second, in the case of sample-wise partitioned data, the proposed algorithm and
a variant of it are analyzed, and their convergence to the true subspace at
linear rates is established. Third, extensive experiments on both synthetic and
real-world data are carried out to validate the usefulness of the proposed
algorithms. In particular, in the case of sample-wise partitioned data, an
MPI-based distributed implementation is carried out to study the interplay
between network topology and communications cost as well as to study of effect
of straggler machines on the proposed algorithms.
Related papers
- Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers [49.1574468325115]
We introduce a unified convergence analysis framework for deterministic samplers.
Our framework achieves iteration complexity of $tilde O(d2/epsilon)$.
We also provide a detailed analysis of Denoising Implicit Diffusion Models (DDIM)-type samplers.
arXiv Detail & Related papers (2024-10-18T07:37:36Z) - Deep Generative Sampling in the Dual Divergence Space: A Data-efficient & Interpretative Approach for Generative AI [29.13807697733638]
We build on the remarkable achievements in generative sampling of natural images.
We propose an innovative challenge, potentially overly ambitious, which involves generating samples that resemble images.
The statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects.
arXiv Detail & Related papers (2024-04-10T22:35:06Z) - Distributed Markov Chain Monte Carlo Sampling based on the Alternating
Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers.
We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art.
In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z) - Distributed Linear Regression with Compositional Covariates [5.085889377571319]
We focus on the distributed sparse penalized linear log-contrast model in massive compositional data.
Two distributed optimization techniques are proposed for solving the two different constrained convex optimization problems.
In the decentralized topology, we introduce a distributed coordinate-wise descent algorithm for obtaining a communication-efficient regularized estimation.
arXiv Detail & Related papers (2023-10-21T11:09:37Z) - Factorization of Multi-Agent Sampling-Based Motion Planning [72.42734061131569]
Modern robotics often involves multiple embodied agents operating within a shared environment.
Standard sampling-based algorithms can be used to search for solutions in the robots' joint space.
We integrate the concept of factorization into sampling-based algorithms, which requires only minimal modifications to existing methods.
We present a general implementation of a factorized SBA, derive an analytical gain in terms of sample complexity for PRM*, and showcase empirical results for RRG.
arXiv Detail & Related papers (2023-04-01T15:50:18Z) - Synthetic-to-Real Domain Generalized Semantic Segmentation for 3D Indoor
Point Clouds [69.64240235315864]
This paper introduces the synthetic-to-real domain generalization setting to this task.
The domain gap between synthetic and real-world point cloud data mainly lies in the different layouts and point patterns.
Experiments on the synthetic-to-real benchmark demonstrate that both CINMix and multi-prototypes can narrow the distribution gap.
arXiv Detail & Related papers (2022-12-09T05:07:43Z) - Revisiting data augmentation for subspace clustering [21.737226432466496]
Subspace clustering is the classical problem of clustering a collection of data samples around several low-dimensional subspaces.
We argue that data distribution within each subspace plays a critical role in the success of self-expressive models.
We propose two subspace clustering frameworks for both unsupervised and semi-supervised settings.
arXiv Detail & Related papers (2022-07-20T08:13:08Z) - Distributed Methods with Compressed Communication for Solving
Variational Inequalities, with Theoretical Guarantees [115.08148491584997]
We present the first theoretically grounded distributed methods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2.
New algorithms support bidirectional compressions, and also can be modified for setting with batches and for federated learning with partial participation of clients.
arXiv Detail & Related papers (2021-10-07T10:04:32Z) - A Linearly Convergent Algorithm for Distributed Principal Component
Analysis [12.91948651812873]
This paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger's Algorithm (DSA)
The proposed algorithm is shown to converge linearly to a neighborhood of the true solution.
arXiv Detail & Related papers (2021-01-05T00:51:14Z) - Reliable Distributed Clustering with Redundant Data Assignment [48.40574754136434]
We present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines.
We propose a novel data assignment scheme that enables us to obtain global information about the entire data even when some machines fail to respond with the results of the assigned local computations.
arXiv Detail & Related papers (2020-02-20T17:44:37Z) - Distributed Bayesian Matrix Decomposition for Big Data Mining and
Clustering [13.491022200305824]
We propose a distributed matrix decomposition model for big data mining and clustering.
Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference.
Our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.
arXiv Detail & Related papers (2020-02-10T13:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.