Reliable Distributed Clustering with Redundant Data Assignment
- URL: http://arxiv.org/abs/2002.08892v1
- Date: Thu, 20 Feb 2020 17:44:37 GMT
- Title: Reliable Distributed Clustering with Redundant Data Assignment
- Authors: Venkata Gandikota, Arya Mazumdar, Ankit Singh Rawat
- Abstract summary: We present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines.
We propose a novel data assignment scheme that enables us to obtain global information about the entire data even when some machines fail to respond with the results of the assigned local computations.
- Score: 48.40574754136434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present distributed generalized clustering algorithms that
can handle large scale data across multiple machines in spite of straggling or
unreliable machines. We propose a novel data assignment scheme that enables us
to obtain global information about the entire data even when some machines fail
to respond with the results of the assigned local computations. The assignment
scheme leads to distributed algorithms with good approximation guarantees for a
variety of clustering and dimensionality reduction problems.
Related papers
- Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - CORE: Common Random Reconstruction for Distributed Optimization with
Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers.
We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z) - Federated K-Means Clustering via Dual Decomposition-based Distributed
Optimization [0.0]
This paper shows how dual decomposition can be applied for distributed training of $ K $-means clustering problems.
The training can be performed in a distributed manner by splitting the data across different nodes and linking these nodes through consensus constraints.
arXiv Detail & Related papers (2023-07-25T05:34:50Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - Generating Multidimensional Clusters With Support Lines [0.0]
We present Clugen, a modular procedure for synthetic data generation.
Cluken is open source, comprehensively unit tested and documented.
We demonstrate that Clugen is fit for use in the assessment of clustering algorithms.
arXiv Detail & Related papers (2023-01-24T22:08:24Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - ExClus: Explainable Clustering on Low-dimensional Data Representations [9.496898312608307]
Dimensionality reduction and clustering techniques are frequently used to analyze complex data sets, but their results are often not easy to interpret.
We consider how to support users in interpreting apparent cluster structure on scatter plots where the axes are not directly interpretable.
We propose a new method to compute an interpretable clustering automatically, where the explanation is in the original high-dimensional space and the clustering is coherent in the low-dimensional projection.
arXiv Detail & Related papers (2021-11-04T21:24:01Z) - Distributed Principal Subspace Analysis for Partitioned Big Data:
Algorithms, Analysis, and Implementation [9.730443503568804]
Principal Subspace Analysis (PSA) is one of the most popular approaches for dimensionality reduction in signal processing and machine learning.
centralized PSA solutions are fast becoming irrelevant in the modern era of big data.
This paper revisits the problem of distributed PSA under the general framework of an arbitrarily connected network of machines.
arXiv Detail & Related papers (2021-03-11T01:33:38Z) - A semi-supervised sparse K-Means algorithm [3.04585143845864]
An unsupervised sparse clustering method can be employed in order to detect the subgroup of features necessary for clustering.
A semi-supervised method can use the labelled data to create constraints and enhance the clustering solution.
We show that the algorithm maintains the high performance of other semi-supervised algorithms and in addition preserves the ability to identify informative from uninformative features.
arXiv Detail & Related papers (2020-03-16T02:05:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.