DS-FACTO: Doubly Separable Factorization Machines
- URL: http://arxiv.org/abs/2004.13940v1
- Date: Wed, 29 Apr 2020 03:36:28 GMT
- Title: DS-FACTO: Doubly Separable Factorization Machines
- Authors: Parameswaran Raman, S.V.N. Vishwanathan
- Abstract summary: Factorization Machines (FM) are powerful class of models that incorporate higher-order interaction among features to add more expressive power to linear models.
Despite using a low-rank representation for the pairwise features, the memory overheads of using factorization machines on large-scale real-world datasets can be prohibitively high.
Traditional algorithms for FM which work on a single-machine are not equipped to handle this scale and therefore, using a distributed algorithm to parallelize computation across a cluster is inevitable.
- Score: 4.281959480566438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Factorization Machines (FM) are powerful class of models that incorporate
higher-order interaction among features to add more expressive power to linear
models. They have been used successfully in several real-world tasks such as
click-prediction, ranking and recommender systems. Despite using a low-rank
representation for the pairwise features, the memory overheads of using
factorization machines on large-scale real-world datasets can be prohibitively
high. For instance on the criteo tera dataset, assuming a modest $128$
dimensional latent representation and $10^{9}$ features, the memory requirement
for the model is in the order of $1$ TB. In addition, the data itself occupies
$2.1$ TB. Traditional algorithms for FM which work on a single-machine are not
equipped to handle this scale and therefore, using a distributed algorithm to
parallelize the computation across a cluster is inevitable. In this work, we
propose a hybrid-parallel stochastic optimization algorithm DS-FACTO, which
partitions both the data as well as parameters of the factorization machine
simultaneously. Our solution is fully de-centralized and does not require the
use of any parameter servers. We present empirical results to analyze the
convergence behavior, predictive power and scalability of DS-FACTO.
Related papers
- HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Flag Aggregator: Scalable Distributed Training under Failures and
Augmented Losses using Convex Optimization [14.732408788010313]
ML applications increasingly rely on complex deep learning models and large datasets.
To scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model.
With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems.
We show that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators.
arXiv Detail & Related papers (2023-02-12T06:38:30Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - FeDXL: Provable Federated Learning for Deep X-Risk Optimization [105.17383135458897]
We tackle a novel federated learning (FL) problem for optimizing a family of X-risks, to which no existing algorithms are applicable.
The challenges for designing an FL algorithm for X-risks lie in the non-decomability of the objective over multiple machines and the interdependency between different machines.
arXiv Detail & Related papers (2022-10-26T00:23:36Z) - Towards Robust and Automatic Hyper-Parameter Tunning [39.04604349338802]
We introduce a new class of HPO method and explore how the low-rank factorization of intermediate layers of a convolutional network can be used to define an analytical response surface.
We quantify how this surface behaves as a surrogate to model performance and can be solved using a trust-region search algorithm, which we call autoHyper.
arXiv Detail & Related papers (2021-11-28T05:27:34Z) - Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via
Generative Models [16.436293069942312]
We are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion.
We propose a general framework that combines disparate data types through the exponential family of distributions.
The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features.
arXiv Detail & Related papers (2021-08-27T18:10:31Z) - Memory-Efficient Factorization Machines via Binarizing both Data and
Model Coefficients [9.692334398809457]
Subspace imating machine (SEFM) has been proposed to overcome the limitation of Factorization Machines (FM)
We propose a new method called Binarized FM which constraints the model parameters to be binary values.
Our proposed method achieves comparable accuracy with SEFM but with much less memory cost.
arXiv Detail & Related papers (2021-08-17T03:30:52Z) - Memory and Computation-Efficient Kernel SVM via Binary Embedding and
Ternary Model Coefficients [18.52747917850984]
Kernel approximation is widely used to scale up kernel SVM training and prediction.
Memory and computation costs of kernel approximation models are still too high if we want to deploy them on memory-limited devices.
We propose a novel memory and computation-efficient kernel SVM model by using both binary embedding and binary model coefficients.
arXiv Detail & Related papers (2020-10-06T09:41:54Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.