Learning Compressed Embeddings for On-Device Inference
- URL: http://arxiv.org/abs/2203.10135v1
- Date: Fri, 18 Mar 2022 19:32:40 GMT
- Title: Learning Compressed Embeddings for On-Device Inference
- Authors: Niketan Pansare, Jay Katukuri, Aditya Arora, Frank Cipollone, Riyaaz
Shaik, Noyan Tokgozoglu, Chandru Venkataraman
- Abstract summary: In deep learning, embeddings are widely used to represent categorical entities such as words, apps, and movies.
In the recommendation domain, a given category can have hundreds of thousands of entities, and its embedding layer can take gigabytes of memory.
We propose a novel approach for reducing the size of an embedding table while still mapping each entity to its own unique embedding.
- Score: 2.5641861018746734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In deep learning, embeddings are widely used to represent categorical
entities such as words, apps, and movies. An embedding layer maps each entity
to a unique vector, causing the layer's memory requirement to be proportional
to the number of entities. In the recommendation domain, a given category can
have hundreds of thousands of entities, and its embedding layer can take
gigabytes of memory. The scale of these networks makes them difficult to deploy
in resource constrained environments. In this paper, we propose a novel
approach for reducing the size of an embedding table while still mapping each
entity to its own unique embedding. Rather than maintaining the full embedding
table, we construct each entity's embedding "on the fly" using two separate
embedding tables. The first table employs hashing to force multiple entities to
share an embedding. The second table contains one trainable weight per entity,
allowing the model to distinguish between entities sharing the same embedding.
Since these two tables are trained jointly, the network is able to learn a
unique embedding per entity, helping it maintain a discriminative capability
similar to a model with an uncompressed embedding table. We call this approach
MEmCom (Multi-Embedding Compression). We compare with state-of-the-art model
compression techniques for multiple problem classes including classification
and ranking. On four popular recommender system datasets, MEmCom had a 4%
relative loss in nDCG while compressing the input embedding sizes of our
recommendation models by 16x, 4x, 12x, and 40x. MEmCom outperforms the
state-of-the-art techniques, which achieved 16%, 6%, 10%, and 8% relative loss
in nDCG at the respective compression ratios. Additionally, MEmCom is able to
compress the RankNet ranking model by 32x on a dataset with millions of users'
interactions with games while incurring only a 1% relative loss in nDCG.
Related papers
- A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.
Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.
Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Head-wise Shareable Attention for Large Language Models [56.92068213969036]
Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices.
Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop.
We present a perspective on head-wise shareable attention for large language models.
arXiv Detail & Related papers (2024-02-19T04:19:36Z) - Mem-Rec: Memory Efficient Recommendation System using Alternative
Representation [6.542635536704625]
MEM-REC is a novel alternative representation approach for embedding tables.
We show that MEM-REC can not only maintain the recommendation quality but can also improve the embedding latency.
arXiv Detail & Related papers (2023-05-12T02:36:07Z) - Learning to Collide: Recommendation System Model Compression with
Learned Hash Functions [4.6994057182972595]
A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables.
A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space.
This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size.
We introduce an alternative approach, Learned Hash Functions, which instead learns a new mapping function that encourages collisions between semantically similar ids.
arXiv Detail & Related papers (2022-03-28T06:07:30Z) - Modeling Heterogeneous Hierarchies with Relation-specific Hyperbolic
Cones [64.75766944882389]
We present ConE (Cone Embedding), a KG embedding model that is able to simultaneously model multiple hierarchical as well as non-hierarchical relations in a knowledge graph.
In particular, ConE uses cone containment constraints in different subspaces of the hyperbolic embedding space to capture multiple heterogeneous hierarchies.
Our approach yields new state-of-the-art Hits@1 of 45.3% on WN18RR and 16.1% on DDB14 (0.231 MRR)
arXiv Detail & Related papers (2021-10-28T07:16:08Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Learning Effective and Efficient Embedding via an Adaptively-Masked
Twins-based Layer [15.403616481651383]
We propose an Adaptively-Masked Twins-based Layer (AMTL) behind the standard embedding layer.
AMTL generates a mask vector to mask the undesired dimensions for each embedding vector.
The mask vector brings flexibility in selecting the dimensions and the proposed layer can be easily added to either untrained or trained DLRMs.
arXiv Detail & Related papers (2021-08-24T11:50:49Z) - Mixed-Precision Embedding Using a Cache [3.0298877977523144]
We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision.
For an open source deep learning recommendation model (DLRM) running with CriteoKaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache.
For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1% of embedding tables.
arXiv Detail & Related papers (2020-10-21T20:49:54Z) - Learning to Embed Categorical Features without Embedding Tables for
Recommendation [22.561967284428707]
We propose an alternative embedding framework, replacing embedding tables by a deep embedding network to compute embeddings on the fly.
The encoding module is deterministic, non-learnable, and free of storage, while the embedding network is updated during the training time to learn embedding generation.
arXiv Detail & Related papers (2020-10-21T06:37:28Z) - Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge.
The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering.
We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.