Learning Compressed Embeddings for On-Device Inference
- URL: http://arxiv.org/abs/2203.10135v1
- Date: Fri, 18 Mar 2022 19:32:40 GMT
- Title: Learning Compressed Embeddings for On-Device Inference
- Authors: Niketan Pansare, Jay Katukuri, Aditya Arora, Frank Cipollone, Riyaaz
Shaik, Noyan Tokgozoglu, Chandru Venkataraman
- Abstract summary: In deep learning, embeddings are widely used to represent categorical entities such as words, apps, and movies.
In the recommendation domain, a given category can have hundreds of thousands of entities, and its embedding layer can take gigabytes of memory.
We propose a novel approach for reducing the size of an embedding table while still mapping each entity to its own unique embedding.
- Score: 2.5641861018746734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In deep learning, embeddings are widely used to represent categorical
entities such as words, apps, and movies. An embedding layer maps each entity
to a unique vector, causing the layer's memory requirement to be proportional
to the number of entities. In the recommendation domain, a given category can
have hundreds of thousands of entities, and its embedding layer can take
gigabytes of memory. The scale of these networks makes them difficult to deploy
in resource constrained environments. In this paper, we propose a novel
approach for reducing the size of an embedding table while still mapping each
entity to its own unique embedding. Rather than maintaining the full embedding
table, we construct each entity's embedding "on the fly" using two separate
embedding tables. The first table employs hashing to force multiple entities to
share an embedding. The second table contains one trainable weight per entity,
allowing the model to distinguish between entities sharing the same embedding.
Since these two tables are trained jointly, the network is able to learn a
unique embedding per entity, helping it maintain a discriminative capability
similar to a model with an uncompressed embedding table. We call this approach
MEmCom (Multi-Embedding Compression). We compare with state-of-the-art model
compression techniques for multiple problem classes including classification
and ranking. On four popular recommender system datasets, MEmCom had a 4%
relative loss in nDCG while compressing the input embedding sizes of our
recommendation models by 16x, 4x, 12x, and 40x. MEmCom outperforms the
state-of-the-art techniques, which achieved 16%, 6%, 10%, and 8% relative loss
in nDCG at the respective compression ratios. Additionally, MEmCom is able to
compress the RankNet ranking model by 32x on a dataset with millions of users'
interactions with games while incurring only a 1% relative loss in nDCG.
Related papers
- Head-wise Shareable Attention for Large Language Models [63.973142426228016]
Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices.
Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop.
We present a perspective on $textit$textbfhead-wise shareable attention for large language models$$.
arXiv Detail & Related papers (2024-02-19T04:19:36Z) - Mem-Rec: Memory Efficient Recommendation System using Alternative
Representation [6.542635536704625]
MEM-REC is a novel alternative representation approach for embedding tables.
We show that MEM-REC can not only maintain the recommendation quality but can also improve the embedding latency.
arXiv Detail & Related papers (2023-05-12T02:36:07Z) - Unicom: Universal and Compact Representation Learning for Image
Retrieval [65.96296089560421]
We cluster the large-scale LAION400M into one million pseudo classes based on the joint textual and visual features extracted by the CLIP model.
To alleviate such conflict, we randomly select partial inter-class prototypes to construct the margin-based softmax loss.
Our method significantly outperforms state-of-the-art unsupervised and supervised image retrieval approaches on multiple benchmarks.
arXiv Detail & Related papers (2023-04-12T14:25:52Z) - Learning to Collide: Recommendation System Model Compression with
Learned Hash Functions [4.6994057182972595]
A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables.
A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space.
This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size.
We introduce an alternative approach, Learned Hash Functions, which instead learns a new mapping function that encourages collisions between semantically similar ids.
arXiv Detail & Related papers (2022-03-28T06:07:30Z) - Modeling Heterogeneous Hierarchies with Relation-specific Hyperbolic
Cones [64.75766944882389]
We present ConE (Cone Embedding), a KG embedding model that is able to simultaneously model multiple hierarchical as well as non-hierarchical relations in a knowledge graph.
In particular, ConE uses cone containment constraints in different subspaces of the hyperbolic embedding space to capture multiple heterogeneous hierarchies.
Our approach yields new state-of-the-art Hits@1 of 45.3% on WN18RR and 16.1% on DDB14 (0.231 MRR)
arXiv Detail & Related papers (2021-10-28T07:16:08Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Learning Effective and Efficient Embedding via an Adaptively-Masked
Twins-based Layer [15.403616481651383]
We propose an Adaptively-Masked Twins-based Layer (AMTL) behind the standard embedding layer.
AMTL generates a mask vector to mask the undesired dimensions for each embedding vector.
The mask vector brings flexibility in selecting the dimensions and the proposed layer can be easily added to either untrained or trained DLRMs.
arXiv Detail & Related papers (2021-08-24T11:50:49Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z) - Mixed-Precision Embedding Using a Cache [3.0298877977523144]
We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision.
For an open source deep learning recommendation model (DLRM) running with CriteoKaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache.
For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1% of embedding tables.
arXiv Detail & Related papers (2020-10-21T20:49:54Z) - Learning to Embed Categorical Features without Embedding Tables for
Recommendation [22.561967284428707]
We propose an alternative embedding framework, replacing embedding tables by a deep embedding network to compute embeddings on the fly.
The encoding module is deterministic, non-learnable, and free of storage, while the embedding network is updated during the training time to learn embedding generation.
arXiv Detail & Related papers (2020-10-21T06:37:28Z) - Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge.
The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering.
We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.