Bitformer: An efficient Transformer with bitwise operation-based
attention for Big Data Analytics at low-cost low-precision devices
- URL: http://arxiv.org/abs/2311.13502v1
- Date: Wed, 22 Nov 2023 16:20:24 GMT
- Title: Bitformer: An efficient Transformer with bitwise operation-based
attention for Big Data Analytics at low-cost low-precision devices
- Authors: Gaoxiang Duan and Junkai Zhang and Xiaoying Zheng and Yongxin Zhu
- Abstract summary: We introduce the Bitformer model, a novel attention mechanism that adeptly replaces conventional floating-point matrix multiplication with bitwise operations.
The transition from an $O(n2d)$ complexity, typical of floating-point operations, to an $O(n2T)$ complexity characterizing bitwise operations, substantiates this advantage.
- Score: 2.484958184370265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the current landscape of large models, the Transformer stands as a
cornerstone, playing a pivotal role in shaping the trajectory of modern models.
However, its application encounters challenges attributed to the substantial
computational intricacies intrinsic to its attention mechanism. Moreover, its
reliance on high-precision floating-point operations presents specific hurdles,
particularly evident in computation-intensive scenarios such as edge computing
environments. These environments, characterized by resource-constrained devices
and a preference for lower precision, necessitate innovative solutions.
To tackle the exacting data processing demands posed by edge devices, we
introduce the Bitformer model, an inventive extension of the Transformer
paradigm. Central to this innovation is a novel attention mechanism that
adeptly replaces conventional floating-point matrix multiplication with bitwise
operations. This strategic substitution yields dual advantages. Not only does
it maintain the attention mechanism's prowess in capturing intricate long-range
information dependencies, but it also orchestrates a profound reduction in the
computational complexity inherent in the attention operation. The transition
from an $O(n^2d)$ complexity, typical of floating-point operations, to an
$O(n^2T)$ complexity characterizing bitwise operations, substantiates this
advantage. Notably, in this context, the parameter $T$ remains markedly smaller
than the conventional dimensionality parameter $d$.
The Bitformer model in essence endeavors to reconcile the indomitable
requirements of modern computing landscapes with the constraints posed by edge
computing scenarios. By forging this innovative path, we bridge the gap between
high-performing models and resource-scarce environments, thus unveiling a
promising trajectory for further advancements in the field.
Related papers
- PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation.
We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel.
We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z) - Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge [35.40849522296486]
Large-scale foundation models (FoMos) can perform human-like intelligence.
FoMos need to be adapted to specialized downstream tasks through fine-tuning techniques.
We advocate multi-device cooperation within the device-edge cooperative fine-tuning paradigm.
arXiv Detail & Related papers (2024-07-13T12:47:14Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Adaptive Point Transformer [88.28498667506165]
Adaptive Point Cloud Transformer (AdaPT) is a standard PT model augmented by an adaptive token selection mechanism.
AdaPT dynamically reduces the number of tokens during inference, enabling efficient processing of large point clouds.
arXiv Detail & Related papers (2024-01-26T13:24:45Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - RegFormer: An Efficient Projection-Aware Transformer Network for
Large-Scale Point Cloud Registration [73.69415797389195]
We propose an end-to-end transformer network (RegFormer) for large-scale point cloud alignment.
Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers.
Our transformer has linear complexity, which guarantees high efficiency even for large-scale scenes.
arXiv Detail & Related papers (2023-03-22T08:47:37Z) - Towards Long-Term Time-Series Forecasting: Feature, Pattern, and
Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning.
Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism.
We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z) - NN-LUT: Neural Approximation of Non-Linear Operations for Efficient
Transformer Inference [9.329021390526124]
Non-linear operations such as GELU, Layer normalization, and Softmax are essential yet costly building blocks of Transformer models.
This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference.
arXiv Detail & Related papers (2021-12-03T23:06:57Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - ExPAN(N)D: Exploring Posits for Efficient Artificial Neural Network
Design in FPGA-based Systems [4.2612881037640085]
This paper analyzes and ingathers the efficacy of the Posit number representation scheme and the efficiency of fixed-point arithmetic implementations for ANNs.
We propose a novel Posit to fixed-point converter for enabling high-performance and energy-efficient hardware implementations for ANNs.
arXiv Detail & Related papers (2020-10-24T11:02:25Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.