Related papers: Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

URL: http://arxiv.org/abs/2002.03703v1
Date: Mon, 10 Feb 2020 13:10:53 GMT
Title: Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering
Authors: Chihao Zhang and Yang Yang and Wei Zhang and Shihua Zhang
Abstract summary: We propose a distributed matrix decomposition model for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. Our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.
Score: 13.491022200305824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Matrix decomposition is one of the fundamental tools to discover knowledge from big data generated by modern applications. However, it is still inefficient or infeasible to process very big data using such a method in a single machine. Moreover, big data are often distributedly collected and stored on different machines. Thus, such data generally bear strong heterogeneous noise. It is essential and useful to develop distributed matrix decomposition for big data analytics. Such a method should scale up well, model the heterogeneous noise, and address the communication issue in a distributed system. To this end, we propose a distributed Bayesian matrix decomposition model (DBMD) for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. We investigate the theoretical convergence behaviors of these algorithms. To address the heterogeneity of the noise, we propose an optimal plug-in weighted average that reduces the variance of the estimation. Synthetic experiments validate our theoretical results, and real-world experiments show that our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.

Related papers

Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration [7.304283080560899]
Modern data analysis increasingly requires identifying shared latent structure across multiple high-dimensional datasets.<n>Two primary methods have emerged for estimating this shared structure, which vary in how they integrate information across datasets.<n>We develop exact expressions for the performance and phase transitions of these two methods and develop optimal weighting schemes to further improve both methods.
arXiv Detail & Related papers (2025-07-29T19:03:01Z)
Low-rank computation of the posterior mean in Multi-Output Gaussian Processes [0.6445605125467574]
We consider the case of multi-output Gaussian processes (MOGP) and present low-rank approaches for efficiently computing the posterior mean of a MOGP.
arXiv Detail & Related papers (2025-04-30T11:19:58Z)
Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity. Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z)
Spectral Clustering for Discrete Distributions [22.450518079181542]
Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods. We show that spectral clustering combined with distribution affinity measures can be more accurate and efficient than barycenter methods. We provide theoretical guarantees for the success of our methods in clustering distributions.
arXiv Detail & Related papers (2024-01-25T03:17:03Z)
Distributed Linear Regression with Compositional Covariates [5.085889377571319]
We focus on the distributed sparse penalized linear log-contrast model in massive compositional data. Two distributed optimization techniques are proposed for solving the two different constrained convex optimization problems. In the decentralized topology, we introduce a distributed coordinate-wise descent algorithm for obtaining a communication-efficient regularized estimation.
arXiv Detail & Related papers (2023-10-21T11:09:37Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Distributed Semi-Supervised Sparse Statistical Inference [6.685997976921953]
A debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters. Traditional methods require computing a debiased estimator on every machine. An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed.
arXiv Detail & Related papers (2023-06-17T17:30:43Z)
A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z)
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data. MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z)
Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization. We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z)
Information Theory Measures via Multidimensional Gaussianization [7.788961560607993]
Information theory is an outstanding framework to measure uncertainty, dependence and relevance in data and systems. It has several desirable properties for real world applications. However, obtaining information from multidimensional data is a challenging problem due to the curse of dimensionality.
arXiv Detail & Related papers (2020-10-08T07:22:16Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.