Distributed Bayesian Matrix Decomposition for Big Data Mining and
Clustering
- URL: http://arxiv.org/abs/2002.03703v1
- Date: Mon, 10 Feb 2020 13:10:53 GMT
- Title: Distributed Bayesian Matrix Decomposition for Big Data Mining and
Clustering
- Authors: Chihao Zhang and Yang Yang and Wei Zhang and Shihua Zhang
- Abstract summary: We propose a distributed matrix decomposition model for big data mining and clustering.
Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference.
Our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.
- Score: 13.491022200305824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Matrix decomposition is one of the fundamental tools to discover knowledge
from big data generated by modern applications. However, it is still
inefficient or infeasible to process very big data using such a method in a
single machine. Moreover, big data are often distributedly collected and stored
on different machines. Thus, such data generally bear strong heterogeneous
noise. It is essential and useful to develop distributed matrix decomposition
for big data analytics. Such a method should scale up well, model the
heterogeneous noise, and address the communication issue in a distributed
system. To this end, we propose a distributed Bayesian matrix decomposition
model (DBMD) for big data mining and clustering. Specifically, we adopt three
strategies to implement the distributed computing including 1) the accelerated
gradient descent, 2) the alternating direction method of multipliers (ADMM),
and 3) the statistical inference. We investigate the theoretical convergence
behaviors of these algorithms. To address the heterogeneity of the noise, we
propose an optimal plug-in weighted average that reduces the variance of the
estimation. Synthetic experiments validate our theoretical results, and
real-world experiments show that our algorithms scale up well to big data and
achieves superior or competing performance compared to other distributed
methods.
Related papers
- Spectral Clustering for Discrete Distributions [22.450518079181542]
Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods.
We show that spectral clustering combined with distribution affinity measures can be more accurate and efficient than barycenter methods.
We provide theoretical guarantees for the success of our methods in clustering distributions.
arXiv Detail & Related papers (2024-01-25T03:17:03Z) - Distributed Linear Regression with Compositional Covariates [5.085889377571319]
We focus on the distributed sparse penalized linear log-contrast model in massive compositional data.
Two distributed optimization techniques are proposed for solving the two different constrained convex optimization problems.
In the decentralized topology, we introduce a distributed coordinate-wise descent algorithm for obtaining a communication-efficient regularized estimation.
arXiv Detail & Related papers (2023-10-21T11:09:37Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Distributed Semi-Supervised Sparse Statistical Inference [6.685997976921953]
A debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters.
Traditional methods require computing a debiased estimator on every machine.
An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed.
arXiv Detail & Related papers (2023-06-17T17:30:43Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Information Theory Measures via Multidimensional Gaussianization [7.788961560607993]
Information theory is an outstanding framework to measure uncertainty, dependence and relevance in data and systems.
It has several desirable properties for real world applications.
However, obtaining information from multidimensional data is a challenging problem due to the curse of dimensionality.
arXiv Detail & Related papers (2020-10-08T07:22:16Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.