Concurrent Alternating Least Squares for multiple simultaneous Canonical
Polyadic Decompositions
- URL: http://arxiv.org/abs/2010.04678v2
- Date: Wed, 8 Sep 2021 11:27:58 GMT
- Title: Concurrent Alternating Least Squares for multiple simultaneous Canonical
Polyadic Decompositions
- Authors: Christos Psarras, Lars Karlsson, Rasmus Bro and Paolo Bientinesi
- Abstract summary: We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab.
We show how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity.
Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
- Score: 2.3513645401551333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a
variety of applications, such as chemometrics, signal processing, and machine
learning. A broadly used method for computing such decompositions relies on the
Alternating Least Squares (ALS) algorithm. When the number of components is
small, regardless of its implementation, ALS exhibits low arithmetic intensity,
which severely hinders its performance and makes GPU offloading ineffective. We
observe that, in practice, experts often have to compute multiple
decompositions of the same tensor, each with a small number of components
(typically fewer than 20), to ultimately find the best ones to use for the
application at hand. In this paper, we illustrate how multiple decompositions
of the same tensor can be fused together at the algorithmic level to increase
the arithmetic intensity. Therefore, it becomes possible to make efficient use
of GPUs for further speedups; at the same time the technique is compatible with
many enhancements typically used in ALS, such as line search, extrapolation,
and non-negativity constraints. We introduce the Concurrent ALS algorithm and
library, which offers an interface to Matlab, and a mechanism to effectively
deal with the issue that decompositions complete at different times.
Experimental results on artificial and real datasets demonstrate a shorter time
to completion due to increased arithmetic intensity.
Related papers
- Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Decreasing the Computing Time of Bayesian Optimization using
Generalizable Memory Pruning [56.334116591082896]
We show a wrapper of memory pruning and bounded optimization capable of being used with any surrogate model and acquisition function.
Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity.
All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
arXiv Detail & Related papers (2023-09-08T14:05:56Z) - Faster Adaptive Momentum-Based Federated Methods for Distributed
Composition Optimization [14.579475552088692]
We propose a class of faster federated composition optimization algorithms (i.e. MFCGD and AdaMFCGD) to solve the non distributed composition problems.
In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates.
arXiv Detail & Related papers (2022-11-03T15:17:04Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - A Communication-Efficient Distributed Gradient Clipping Algorithm for
Training Deep Neural Networks [11.461878019780597]
Gradient Descent might converge slowly in some deep neural networks.
It remains mysterious whether gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup.
arXiv Detail & Related papers (2022-05-10T16:55:33Z) - Alternating Mahalanobis Distance Minimization for Stable and Accurate CP
Decomposition [4.847980206213335]
We introduce a new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work.
We show that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank.
We then view the algorithm as optimizing a Mahalanobis distance with respect to each factor with ground metric dependent on the other factors.
arXiv Detail & Related papers (2022-04-14T19:56:36Z) - Efficient GPU implementation of randomized SVD and its applications [17.71779625877989]
Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality data compression and deep learning algorithms.
Typical solutions for matrix decompositions have complexity which significantly increases their computational cost and time.
We leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs) to reduce the computational burden of computing matrix decompositions.
arXiv Detail & Related papers (2021-10-05T07:42:41Z) - AsySQN: Faster Vertical Federated Learning Algorithms with Better
Computation Resource Utilization [159.75564904944707]
We propose an asynchronous quasi-Newton (AsySQN) framework for vertical federated learning (VFL)
The proposed algorithms make descent steps scaled by approximate without calculating the inverse Hessian matrix explicitly.
We show that the adopted asynchronous computation can make better use of the computation resource.
arXiv Detail & Related papers (2021-09-26T07:56:10Z) - Poly-NL: Linear Complexity Non-local Layers with Polynomials [76.21832434001759]
We formulate novel fast NonLocal blocks, capable of reducing complexity from quadratic to linear with no loss in performance.
The proposed method, which we dub as "Poly-NL", is competitive with state-of-the-art performance across image recognition, instance segmentation, and face detection tasks.
arXiv Detail & Related papers (2021-07-06T19:51:37Z) - Efficient Alternating Least Squares Algorithms for Low Multilinear Rank
Approximation of Tensors [6.308492837096872]
We propose a new class of truncated HOSVD algorithms based on alternating least squares (ALS) for efficiently computing the low multilinear rank approximation of tensors.
ALS-based approaches are able to eliminate the redundant computations of the singular vectors of intermediate matrices and are therefore free of data explosion.
Numerical experiments with large-scale tensors from both synthetic and real-world applications demonstrate that ALS-based methods can substantially reduce the total cost of the original ones and are highly scalable for parallel computing.
arXiv Detail & Related papers (2020-04-06T11:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.