Rediscovering Hashed Random Projections for Efficient Quantization of
Contextualized Sentence Embeddings
- URL: http://arxiv.org/abs/2304.02481v2
- Date: Tue, 16 May 2023 05:39:46 GMT
- Title: Rediscovering Hashed Random Projections for Efficient Quantization of
Contextualized Sentence Embeddings
- Authors: Ulf A. Hamster, Ji-Ung Lee, Alexander Geyken, Iryna Gurevych
- Abstract summary: Training and inference on edge devices often requires an efficient setup due to computational limitations.
Pre-computing data representations and caching them on a server can mitigate extensive edge device computation.
We propose a simple, yet effective approach that uses randomly hyperplane projections.
We show that the embeddings remain effective for training models across various English and German sentence classification tasks that retain 94%--99% of their floating-point.
- Score: 113.38884267189871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training and inference on edge devices often requires an efficient setup due
to computational limitations. While pre-computing data representations and
caching them on a server can mitigate extensive edge device computation, this
leads to two challenges. First, the amount of storage required on the server
that scales linearly with the number of instances. Second, the bandwidth
required to send extensively large amounts of data to an edge device. To reduce
the memory footprint of pre-computed data representations, we propose a simple,
yet effective approach that uses randomly initialized hyperplane projections.
To further reduce their size by up to 98.96%, we quantize the resulting
floating-point representations into binary vectors. Despite the greatly reduced
size, we show that the embeddings remain effective for training models across
various English and German sentence classification tasks that retain 94%--99%
of their floating-point.
Related papers
- Low-Precision Floating-Point for Efficient On-Board Deep Neural Network
Processing [0.9374652839580183]
We study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology.
Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision.
An initial hardware study also confirms the potential impact of such low-precision floating-point designs.
arXiv Detail & Related papers (2023-11-18T21:36:52Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z) - An introduction to distributed training of deep neural networks for
segmentation tasks with large seismic datasets [0.0]
This paper illustrates how to tackle the two main issues of training of large neural networks: memory limitations and impracticably large training times.
We show how over 750GB of data can be used to train a model by using a data generator approach which only stores in memory the data required for that training batch.
Furthermore, efficient training over large models is illustrated through the training of a 7-layer UNet with input data dimensions of 4096,4096.
arXiv Detail & Related papers (2021-02-25T17:06:00Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.