Related papers: Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation

Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation

URL: http://arxiv.org/abs/2103.06406v1
Date: Thu, 11 Mar 2021 01:33:38 GMT
Title: Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation
Authors: Bingqing Xiang, Arpita Gang, and Waheed U. Bajwa
Abstract summary: Principal Subspace Analysis (PSA) is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. centralized PSA solutions are fast becoming irrelevant in the modern era of big data. This paper revisits the problem of distributed PSA under the general framework of an arbitrarily connected network of machines.
Score: 9.730443503568804
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Principal Subspace Analysis (PSA) is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA solutions are fast becoming irrelevant in the modern era of big data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to study of distributed PSA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA in the case of data that are partitioned across either samples or (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study of effect of straggler machines on the proposed algorithms.

Related papers

Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity. Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z)
Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers [49.1574468325115]
We introduce a unified convergence analysis framework for deterministic samplers. Our framework achieves iteration complexity of $tilde O(d2/epsilon)$. We also provide a detailed analysis of Denoising Implicit Diffusion Models (DDIM)-type samplers.
arXiv Detail & Related papers (2024-10-18T07:37:36Z)
Deep Generative Sampling in the Dual Divergence Space: A Data-efficient & Interpretative Approach for Generative AI [29.13807697733638]
We build on the remarkable achievements in generative sampling of natural images. We propose an innovative challenge, potentially overly ambitious, which involves generating samples that resemble images. The statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects.
arXiv Detail & Related papers (2024-04-10T22:35:06Z)
Distributed Markov Chain Monte Carlo Sampling based on the Alternating Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers. We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art. In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z)
Distributed Linear Regression with Compositional Covariates [5.085889377571319]
We focus on the distributed sparse penalized linear log-contrast model in massive compositional data. Two distributed optimization techniques are proposed for solving the two different constrained convex optimization problems. In the decentralized topology, we introduce a distributed coordinate-wise descent algorithm for obtaining a communication-efficient regularized estimation.
arXiv Detail & Related papers (2023-10-21T11:09:37Z)
Factorization of Multi-Agent Sampling-Based Motion Planning [72.42734061131569]
Modern robotics often involves multiple embodied agents operating within a shared environment. Standard sampling-based algorithms can be used to search for solutions in the robots' joint space. We integrate the concept of factorization into sampling-based algorithms, which requires only minimal modifications to existing methods. We present a general implementation of a factorized SBA, derive an analytical gain in terms of sample complexity for PRM*, and showcase empirical results for RRG.
arXiv Detail & Related papers (2023-04-01T15:50:18Z)
Synthetic-to-Real Domain Generalized Semantic Segmentation for 3D Indoor Point Clouds [69.64240235315864]
This paper introduces the synthetic-to-real domain generalization setting to this task. The domain gap between synthetic and real-world point cloud data mainly lies in the different layouts and point patterns. Experiments on the synthetic-to-real benchmark demonstrate that both CINMix and multi-prototypes can narrow the distribution gap.
arXiv Detail & Related papers (2022-12-09T05:07:43Z)
Revisiting data augmentation for subspace clustering [21.737226432466496]
Subspace clustering is the classical problem of clustering a collection of data samples around several low-dimensional subspaces. We argue that data distribution within each subspace plays a critical role in the success of self-expressive models. We propose two subspace clustering frameworks for both unsupervised and semi-supervised settings.
arXiv Detail & Related papers (2022-07-20T08:13:08Z)
Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees [115.08148491584997]
We present the first theoretically grounded distributed methods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2. New algorithms support bidirectional compressions, and also can be modified for setting with batches and for federated learning with partial participation of clients.
arXiv Detail & Related papers (2021-10-07T10:04:32Z)
A Linearly Convergent Algorithm for Distributed Principal Component Analysis [12.91948651812873]
This paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger's Algorithm (DSA) The proposed algorithm is shown to converge linearly to a neighborhood of the true solution.
arXiv Detail & Related papers (2021-01-05T00:51:14Z)
Reliable Distributed Clustering with Redundant Data Assignment [48.40574754136434]
We present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about the entire data even when some machines fail to respond with the results of the assigned local computations.
arXiv Detail & Related papers (2020-02-20T17:44:37Z)
Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering [13.491022200305824]
We propose a distributed matrix decomposition model for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. Our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.
arXiv Detail & Related papers (2020-02-10T13:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.