Training with Multi-Layer Embeddings for Model Reduction
- URL: http://arxiv.org/abs/2006.05623v1
- Date: Wed, 10 Jun 2020 02:47:40 GMT
- Title: Training with Multi-Layer Embeddings for Model Reduction
- Authors: Benjamin Ghaemmaghami, Zihao Deng, Benjamin Cho, Leo Orshansky, Ashish
Kumar Singh, Mattan Erez, and Michael Orshansky
- Abstract summary: We introduce a multi-layer embedding training architecture that trains embeddings via a sequence of linear layers.
We show that it allows reducing d by 4-8X, with a corresponding improvement in memory footprint, at given model accuracy.
- Score: 0.9046327456472286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern recommendation systems rely on real-valued embeddings of categorical
features. Increasing the dimension of embedding vectors improves model accuracy
but comes at a high cost to model size. We introduce a multi-layer embedding
training (MLET) architecture that trains embeddings via a sequence of linear
layers to derive superior embedding accuracy vs. model size trade-off.
Our approach is fundamentally based on the ability of factorized linear
layers to produce superior embeddings to that of a single linear layer. We
focus on the analysis and implementation of a two-layer scheme. Harnessing the
recent results in dynamics of backpropagation in linear neural networks, we
explain the ability to get superior multi-layer embeddings via their tendency
to have lower effective rank. We show that substantial advantages are obtained
in the regime where the width of the hidden layer is much larger than that of
the final embedding (d). Crucially, at conclusion of training, we convert the
two-layer solution into a single-layer one: as a result, the inference-time
model size scales as d.
We prototype the MLET scheme within Facebook's PyTorch-based open-source Deep
Learning Recommendation Model. We show that it allows reducing d by 4-8X, with
a corresponding improvement in memory footprint, at given model accuracy. The
experiments are run on two publicly available click-through-rate prediction
benchmarks (Criteo-Kaggle and Avazu). The runtime cost of MLET is 25%, on
average.
Related papers
- Starbucks: Improved Training for 2D Matryoshka Embeddings [32.44832240958393]
We propose Starbucks, a new training strategy for Matryoshka-like embedding models.
For the fine-tuning phase, we provide a fixed list of layer-dimension pairs, from small size to large sizes.
We also introduce a new pre-training strategy, which applies masked autoencoder language modelling to sub-layers and sub-dimensions.
arXiv Detail & Related papers (2024-10-17T05:33:50Z) - Enhancing Cross-Category Learning in Recommendation Systems with
Multi-Layer Embedding Training [2.4862527485819186]
Multi-layer embeddings training (MLET) trains embeddings using factorization of the embedding layer, with an inner dimension higher than the target embedding dimension.
MLET consistently produces better models, especially for rare items.
At constant model quality, MLET allows embedding dimension, and model size, reduction by up to 16x, and 5.8x on average.
arXiv Detail & Related papers (2023-09-27T09:32:10Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - LV-BERT: Exploiting Layer Variety for BERT [85.27287501885807]
We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
arXiv Detail & Related papers (2021-06-22T13:20:14Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - A block coordinate descent optimizer for classification problems
exploiting convexity [0.0]
We introduce a coordinate descent method to deep linear networks for classification tasks that exploits convexity of the cross-entropy loss in the weights of the hidden layer.
By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training.
arXiv Detail & Related papers (2020-06-17T19:49:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.