CPU- and GPU-based Distributed Sampling in Dirichlet Process Mixtures
for Large-scale Analysis
- URL: http://arxiv.org/abs/2204.08988v1
- Date: Tue, 19 Apr 2022 16:35:44 GMT
- Title: CPU- and GPU-based Distributed Sampling in Dirichlet Process Mixtures
for Large-scale Analysis
- Authors: Or Dinari, Raz Zamir, John W. Fisher III, Oren Freifeld
- Abstract summary: Dirichlet Process Mixture Model (DPMM) is a principled approach for adapting the complexity of the model to the data.
Despite their potential and mathematical elegance, DPMMs have yet to become a mainstream tool widely adopted by practitioners.
We propose a new, easy-to-use, statistical software package for scalable DPMMM inference.
- Score: 11.071895608242675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the realm of unsupervised learning, Bayesian nonparametric mixture models,
exemplified by the Dirichlet Process Mixture Model (DPMM), provide a principled
approach for adapting the complexity of the model to the data. Such models are
particularly useful in clustering tasks where the number of clusters is
unknown. Despite their potential and mathematical elegance, however, DPMMs have
yet to become a mainstream tool widely adopted by practitioners. This is
arguably due to a misconception that these models scale poorly as well as the
lack of high-performance (and user-friendly) software tools that can handle
large datasets efficiently. In this paper we bridge this practical gap by
proposing a new, easy-to-use, statistical software package for scalable DPMM
inference. More concretely, we provide efficient and easily-modifiable
implementations for high-performance distributed sampling-based inference in
DPMMs where the user is free to choose between either a multiple-machine,
multiple-core, CPU implementation (written in Julia) and a multiple-stream GPU
implementation (written in CUDA/C++). Both the CPU and GPU implementations come
with a common (and optional) python wrapper, providing the user with a single
point of entry with the same interface. On the algorithmic side, our
implementations leverage a leading DPMM sampler from (Chang and Fisher III,
2013). While Chang and Fisher III's implementation (written in MATLAB/C++) used
only CPU and was designed for a single multi-core machine, the packages we
proposed here distribute the computations efficiently across either multiple
multi-core machines or across mutiple GPU streams. This leads to speedups,
alleviates memory and storage limitations, and lets us fit DPMMs to
significantly larger datasets and of higher dimensionality than was possible
previously by either (Chang and Fisher III, 2013) or other DPMM methods.
Related papers
- Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms.
While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking.
We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z) - Distributed Collapsed Gibbs Sampler for Dirichlet Process Mixture Models
in Federated Learning [0.22499166814992444]
This paper proposes a new distributed Markov Chain Monte Carlo (MCMC) inference method for DPMMs (DisCGS) using sufficient statistics.
Our approach uses the collapsed Gibbs sampler and is specifically designed to work on distributed data across independent and heterogeneous machines.
For instance, with a dataset of 100K data points, the centralized algorithm requires approximately 12 hours to complete 100 iterations while our approach achieves the same number of iterations in just 3 minutes.
arXiv Detail & Related papers (2023-12-18T13:16:18Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Online Multi-Object Tracking and Segmentation with GMPHD Filter and
Mask-based Affinity Fusion [79.87371506464454]
We propose a fully online multi-object tracking and segmentation (MOTS) method that uses instance segmentation results as an input.
The proposed method is based on the Gaussian mixture probability hypothesis density (GMPHD) filter, a hierarchical data association (HDA), and a mask-based affinity fusion (MAF) model.
In the experiments on the two popular MOTS datasets, the key modules show some improvements.
arXiv Detail & Related papers (2020-08-31T21:06:22Z) - HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data
Analytics [0.0]
HeAT is an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API.
HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI.
When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.
arXiv Detail & Related papers (2020-07-27T13:33:17Z) - Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic
Circuits [99.59941892183454]
We propose Einsum Networks (EiNets), a novel implementation design for PCs.
At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation.
We show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation.
arXiv Detail & Related papers (2020-04-13T23:09:15Z) - Particle-Gibbs Sampling For Bayesian Feature Allocation Models [77.57285768500225]
Most widely used MCMC strategies rely on an element wise Gibbs update of the feature allocation matrix.
We have developed a Gibbs sampler that can update an entire row of the feature allocation matrix in a single move.
This sampler is impractical for models with a large number of features as the computational complexity scales exponentially in the number of features.
We develop a Particle Gibbs sampler that targets the same distribution as the row wise Gibbs updates, but has computational complexity that only grows linearly in the number of features.
arXiv Detail & Related papers (2020-01-25T22:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.