Related papers: Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

URL: http://arxiv.org/abs/2501.12299v1
Date: Tue, 21 Jan 2025 17:11:25 GMT
Title: Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters
Authors: Sebastian Salwig, Till Kahlke, Florian Hirschberger, Dennis Forster, Jörg Lücke,
Abstract summary: We train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.<n>Our proposed algorithm significantly reduces runtime complexity per iteration from $mathcalO(NCD2)$ to a complexity scaling linearly with $D$ and remaining constant w.r.t.<n>As a proof of concept, we train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.
Score: 5.429282997550318
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gaussian Mixture Models (GMMs) range among the most frequently used machine learning models. However, training large, general GMMs becomes computationally prohibitive for datasets with many data points $N$ of high-dimensionality $D$. For GMMs with arbitrary covariances, we here derive a highly efficient variational approximation, which is integrated with mixtures of factor analyzers (MFAs). For GMMs with $C$ components, our proposed algorithm significantly reduces runtime complexity per iteration from $\mathcal{O}(NCD^2)$ to a complexity scaling linearly with $D$ and remaining constant w.r.t. $C$. Numerical validation of this theoretical complexity reduction then shows the following: the distance evaluations required for the entire GMM optimization process scale sublinearly with $NC$. On large-scale benchmarks, this sublinearity results in speed-ups of an order-of-magnitude compared to the state-of-the-art. As a proof of concept, we train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.

Related papers

Diagonally-Weighted Generalized Method of Moments Estimation for Gaussian Mixture Modeling [7.266474956854856]
The diagonally-weighted GMM (DGMM) achieves a balance among statistical efficiency, computational complexity, and numerical stability.<n>We apply DGMM to study the parameter estimation problem for weakly separated heteroscedastic low-rank Gaussian mixtures.<n>In numerical studies, DGMM attains smaller estimation errors while requiring substantially shorter runtimes than MM and GMM.
arXiv Detail & Related papers (2025-07-28T01:24:55Z)
An accuracy-runtime trade-off comparison of scalable Gaussian process approximations for spatial data [11.141688859736805]
We systematically compare the accuracy of different Gaussian process approximations concerning likelihood evaluation, parameter estimation, and prediction.<n>We find that Vecchia approximations consistently emerge as the most accurate in almost all experiments.
arXiv Detail & Related papers (2025-01-20T12:35:58Z)
Stochastic First-Order Learning for Large-Scale Flexibly Tied Gaussian Mixture Model [3.4546761246181696]
We propose a new optimization algorithm on the manifold of Gaussian Mixture Models (GMMs) We observe that methods can outperform the expectation-maximization algorithm in terms of attaining better likelihood, needing fewer epochs for convergence, and consuming less time per each epoch.
arXiv Detail & Related papers (2022-12-11T04:24:52Z)
Sparse Kernel Gaussian Processes through Iterative Charted Refinement (ICR) [0.0]
We present a new, generative method named Iterative Charted Refinement (ICR) to model Gaussian Processes. ICR represents long- and short-range correlations by combining views of the modeled locations at varying resolutions with a user-provided coordinate chart. ICR outperforms existing methods in terms of computational speed by one order of magnitude on the CPU and GPU.
arXiv Detail & Related papers (2022-06-21T18:00:01Z)
Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning. Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system. The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z)
Tensor Moments of Gaussian Mixture Models: Theory and Applications [1.6114012813668934]
We develop theory and numerical methods for implicit computations with moment tensors of GMMs. We show it is possible to debias the data observations, in which case the problem of unknown means reduces to symmetric tensor decomposition.
arXiv Detail & Related papers (2022-02-14T18:42:11Z)
Robust Model Selection and Nearly-Proper Learning for GMMs [26.388358539260473]
In learning theory, a standard assumption is that the data is generated from a finite mixture model. But what happens when the number of components is not known in advance? We are able to approximately determine the minimum number of components needed to fit the distribution within a logarithmic factor.
arXiv Detail & Related papers (2021-06-05T01:58:40Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
Faster Stochastic Alternating Direction Method of Multipliers for Nonconvex Optimization [110.52708815647613]
In this paper, we propose a faster alternating direction of multipliers (ADMM) for non-integrated optimization by using a new path, called SPADMM. We prove that the SPADMM achieves a-breaking first-order differential oracle estimator (IFO) for finding a record of an IFO. Our theoretical analysis shows that the online SPIDER-ADMM has the IFOFO(epsilon) by a factor of $mathcalO(n1)$.
arXiv Detail & Related papers (2020-08-04T02:59:42Z)
Convergence of Sparse Variational Inference in Gaussian Processes Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z)
Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters [152.03852111442114]
Although model-agnostic metalearning (MAML) is a very successful algorithm meta-learning practice, it can have high computational complexity. Our paper shows that such complexity can significantly affect the overall convergence performance of ANIL.
arXiv Detail & Related papers (2020-06-16T19:57:48Z)
FANOK: Knockoffs in Linear Time [73.5154025911318]
We describe a series of algorithms that efficiently implement Gaussian model-X knockoffs to control the false discovery rate on large scale feature selection problems. We test our methods on problems with $p$ as large as $500,000$.
arXiv Detail & Related papers (2020-06-15T21:55:34Z)
Learning Gaussian Graphical Models via Multiplicative Weights [54.252053139374205]
We adapt an algorithm of Klivans and Meka based on the method of multiplicative weight updates. The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature. It has a low runtime $O(mp2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.
arXiv Detail & Related papers (2020-02-20T10:50:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.