Related papers: Learning Compressed Embeddings for On-Device Inference

Learning Compressed Embeddings for On-Device Inference

URL: http://arxiv.org/abs/2203.10135v1
Date: Fri, 18 Mar 2022 19:32:40 GMT
Title: Learning Compressed Embeddings for On-Device Inference
Authors: Niketan Pansare, Jay Katukuri, Aditya Arora, Frank Cipollone, Riyaaz Shaik, Noyan Tokgozoglu, Chandru Venkataraman
Abstract summary: In deep learning, embeddings are widely used to represent categorical entities such as words, apps, and movies. In the recommendation domain, a given category can have hundreds of thousands of entities, and its embedding layer can take gigabytes of memory. We propose a novel approach for reducing the size of an embedding table while still mapping each entity to its own unique embedding.
Score: 2.5641861018746734
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In deep learning, embeddings are widely used to represent categorical entities such as words, apps, and movies. An embedding layer maps each entity to a unique vector, causing the layer's memory requirement to be proportional to the number of entities. In the recommendation domain, a given category can have hundreds of thousands of entities, and its embedding layer can take gigabytes of memory. The scale of these networks makes them difficult to deploy in resource constrained environments. In this paper, we propose a novel approach for reducing the size of an embedding table while still mapping each entity to its own unique embedding. Rather than maintaining the full embedding table, we construct each entity's embedding "on the fly" using two separate embedding tables. The first table employs hashing to force multiple entities to share an embedding. The second table contains one trainable weight per entity, allowing the model to distinguish between entities sharing the same embedding. Since these two tables are trained jointly, the network is able to learn a unique embedding per entity, helping it maintain a discriminative capability similar to a model with an uncompressed embedding table. We call this approach MEmCom (Multi-Embedding Compression). We compare with state-of-the-art model compression techniques for multiple problem classes including classification and ranking. On four popular recommender system datasets, MEmCom had a 4% relative loss in nDCG while compressing the input embedding sizes of our recommendation models by 16x, 4x, 12x, and 40x. MEmCom outperforms the state-of-the-art techniques, which achieved 16%, 6%, 10%, and 8% relative loss in nDCG at the respective compression ratios. Additionally, MEmCom is able to compress the RankNet ranking model by 32x on a dataset with millions of users' interactions with games while incurring only a 1% relative loss in nDCG.

Related papers

Head-wise Shareable Attention for Large Language Models [63.973142426228016]
Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices. Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop. We present a perspective on $textit$textbfhead-wise shareable attention for large language models$$.
arXiv Detail & Related papers (2024-02-19T04:19:36Z)
Mem-Rec: Memory Efficient Recommendation System using Alternative Representation [6.542635536704625]
MEM-REC is a novel alternative representation approach for embedding tables. We show that MEM-REC can not only maintain the recommendation quality but can also improve the embedding latency.
arXiv Detail & Related papers (2023-05-12T02:36:07Z)
Unicom: Universal and Compact Representation Learning for Image Retrieval [65.96296089560421]
We cluster the large-scale LAION400M into one million pseudo classes based on the joint textual and visual features extracted by the CLIP model. To alleviate such conflict, we randomly select partial inter-class prototypes to construct the margin-based softmax loss. Our method significantly outperforms state-of-the-art unsupervised and supervised image retrieval approaches on multiple benchmarks.
arXiv Detail & Related papers (2023-04-12T14:25:52Z)
Learning to Collide: Recommendation System Model Compression with Learned Hash Functions [4.6994057182972595]
A key characteristic of deep recommendation models is the immense memory requirements of their embedding tables. A common technique to reduce model size is to hash all of the categorical variable identifiers (ids) into a smaller space. This hashing reduces the number of unique representations that must be stored in the embedding table; thus decreasing its size. We introduce an alternative approach, Learned Hash Functions, which instead learns a new mapping function that encourages collisions between semantically similar ids.
arXiv Detail & Related papers (2022-03-28T06:07:30Z)
Modeling Heterogeneous Hierarchies with Relation-specific Hyperbolic Cones [64.75766944882389]
We present ConE (Cone Embedding), a KG embedding model that is able to simultaneously model multiple hierarchical as well as non-hierarchical relations in a knowledge graph. In particular, ConE uses cone containment constraints in different subspaces of the hyperbolic embedding space to capture multiple heterogeneous hierarchies. Our approach yields new state-of-the-art Hits@1 of 45.3% on WN18RR and 16.1% on DDB14 (0.231 MRR)
arXiv Detail & Related papers (2021-10-28T07:16:08Z)
Compact representations of convolutional neural networks via weight pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization. We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z)
Learning Effective and Efficient Embedding via an Adaptively-Masked Twins-based Layer [15.403616481651383]
We propose an Adaptively-Masked Twins-based Layer (AMTL) behind the standard embedding layer. AMTL generates a mask vector to mask the undesired dimensions for each embedding vector. The mask vector brings flexibility in selecting the dimensions and the proposed layer can be easily added to either untrained or trained DLRMs.
arXiv Detail & Related papers (2021-08-24T11:50:49Z)
You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments. We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere. Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z)
Mixed-Precision Embedding Using a Cache [3.0298877977523144]
We propose a novel change to embedding tables using a cache memory architecture, where the majority of rows in an embedding is trained in low precision. For an open source deep learning recommendation model (DLRM) running with CriteoKaggle dataset, we achieve 3x memory reduction with INT8 precision embedding tables and full-precision cache. For an industrial scale model and dataset, we achieve even higher >7x memory reduction with INT4 precision and cache size 1% of embedding tables.
arXiv Detail & Related papers (2020-10-21T20:49:54Z)
Learning to Embed Categorical Features without Embedding Tables for Recommendation [22.561967284428707]
We propose an alternative embedding framework, replacing embedding tables by a deep embedding network to compute embeddings on the fly. The encoding module is deterministic, non-learnable, and free of storage, while the embedding network is updated during the training time to learn embedding generation.
arXiv Detail & Related papers (2020-10-21T06:37:28Z)
Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge. The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.