Enhancing Cross-Category Learning in Recommendation Systems with
Multi-Layer Embedding Training
- URL: http://arxiv.org/abs/2309.15881v1
- Date: Wed, 27 Sep 2023 09:32:10 GMT
- Title: Enhancing Cross-Category Learning in Recommendation Systems with
Multi-Layer Embedding Training
- Authors: Zihao Deng, Benjamin Ghaemmaghami, Ashish Kumar Singh, Benjamin Cho,
Leo Orshansky, Mattan Erez, Michael Orshansky
- Abstract summary: Multi-layer embeddings training (MLET) trains embeddings using factorization of the embedding layer, with an inner dimension higher than the target embedding dimension.
MLET consistently produces better models, especially for rare items.
At constant model quality, MLET allows embedding dimension, and model size, reduction by up to 16x, and 5.8x on average.
- Score: 2.4862527485819186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern DNN-based recommendation systems rely on training-derived embeddings
of sparse features. Input sparsity makes obtaining high-quality embeddings for
rarely-occurring categories harder as their representations are updated
infrequently. We demonstrate a training-time technique to produce superior
embeddings via effective cross-category learning and theoretically explain its
surprising effectiveness. The scheme, termed the multi-layer embeddings
training (MLET), trains embeddings using factorization of the embedding layer,
with an inner dimension higher than the target embedding dimension. For
inference efficiency, MLET converts the trained two-layer embedding into a
single-layer one thus keeping inference-time model size unchanged.
Empirical superiority of MLET is puzzling as its search space is not larger
than that of the single-layer embedding. The strong dependence of MLET on the
inner dimension is even more surprising. We develop a theory that explains both
of these behaviors by showing that MLET creates an adaptive update mechanism
modulated by the singular vectors of embeddings. When tested on multiple
state-of-the-art recommendation models for click-through rate (CTR) prediction
tasks, MLET consistently produces better models, especially for rare items. At
constant model quality, MLET allows embedding dimension, and model size,
reduction by up to 16x, and 5.8x on average, across the models.
Related papers
- 2D Matryoshka Training for Information Retrieval [32.44832240958393]
2D Matryoshka Training is an embedding representation training approach designed to train an encoder model simultaneously across various layer-dimension setups.
We implement and evaluate both versions of 2D Matryoshka Training on STS tasks and extend our analysis to retrieval tasks.
arXiv Detail & Related papers (2024-11-26T10:47:35Z) - Starbucks: Improved Training for 2D Matryoshka Embeddings [32.44832240958393]
We propose Starbucks, a new training strategy for Matryoshka-like embedding models.
For the fine-tuning phase, we provide a fixed list of layer-dimension pairs, from small size to large sizes.
We also introduce a new pre-training strategy, which applies masked autoencoder language modelling to sub-layers and sub-dimensions.
arXiv Detail & Related papers (2024-10-17T05:33:50Z) - Inheritune: Training Smaller Yet More Attentive Language Models [61.363259848264725]
Inheritune is a simple yet effective training recipe for developing smaller, high-performing language models.
We demonstrate that Inheritune enables the training of various sizes of GPT-2 models on datasets like OpenWebText-9B and FineWeb_edu.
arXiv Detail & Related papers (2024-04-12T17:53:34Z) - Fine-Grained Embedding Dimension Optimization During Training for Recommender Systems [17.602059421895856]
FIITED is a system to automatically reduce the memory footprint via FIne-grained In-Training Embedding Dimension pruning.
We show that FIITED can reduce DLRM embedding size by more than 65% while preserving model quality.
On public datasets, FIITED can reduce the size of embedding tables by 2.1x to 800x with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-09T08:04:11Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - RanPAC: Random Projections and Pre-trained Models for Continual Learning [59.07316955610658]
Continual learning (CL) aims to learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones.
We propose a concise and effective approach for CL with pre-trained models.
arXiv Detail & Related papers (2023-07-05T12:49:02Z) - Phantom Embeddings: Using Embedding Space for Model Regularization in
Deep Neural Networks [12.293294756969477]
The strength of machine learning models stems from their ability to learn complex function approximations from data.
The complex models tend to memorize the training data, which results in poor regularization performance on test data.
We present a novel approach to regularize the models by leveraging the information-rich latent embeddings and their high intra-class correlation.
arXiv Detail & Related papers (2023-04-14T17:15:54Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - MACE: An Efficient Model-Agnostic Framework for Counterfactual
Explanation [132.77005365032468]
We propose a novel framework of Model-Agnostic Counterfactual Explanation (MACE)
In our MACE approach, we propose a novel RL-based method for finding good counterfactual examples and a gradient-less descent method for improving proximity.
Experiments on public datasets validate the effectiveness with better validity, sparsity and proximity.
arXiv Detail & Related papers (2022-05-31T04:57:06Z) - LV-BERT: Exploiting Layer Variety for BERT [85.27287501885807]
We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
arXiv Detail & Related papers (2021-06-22T13:20:14Z) - Training with Multi-Layer Embeddings for Model Reduction [0.9046327456472286]
We introduce a multi-layer embedding training architecture that trains embeddings via a sequence of linear layers.
We show that it allows reducing d by 4-8X, with a corresponding improvement in memory footprint, at given model accuracy.
arXiv Detail & Related papers (2020-06-10T02:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.