Related papers: Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

URL: http://arxiv.org/abs/2311.13502v1
Date: Wed, 22 Nov 2023 16:20:24 GMT
Title: Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices
Authors: Gaoxiang Duan and Junkai Zhang and Xiaoying Zheng and Yongxin Zhu
Abstract summary: We introduce the Bitformer model, a novel attention mechanism that adeptly replaces conventional floating-point matrix multiplication with bitwise operations. The transition from an $O(n2d)$ complexity, typical of floating-point operations, to an $O(n2T)$ complexity characterizing bitwise operations, substantiates this advantage.
Score: 2.484958184370265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the current landscape of large models, the Transformer stands as a cornerstone, playing a pivotal role in shaping the trajectory of modern models. However, its application encounters challenges attributed to the substantial computational intricacies intrinsic to its attention mechanism. Moreover, its reliance on high-precision floating-point operations presents specific hurdles, particularly evident in computation-intensive scenarios such as edge computing environments. These environments, characterized by resource-constrained devices and a preference for lower precision, necessitate innovative solutions. To tackle the exacting data processing demands posed by edge devices, we introduce the Bitformer model, an inventive extension of the Transformer paradigm. Central to this innovation is a novel attention mechanism that adeptly replaces conventional floating-point matrix multiplication with bitwise operations. This strategic substitution yields dual advantages. Not only does it maintain the attention mechanism's prowess in capturing intricate long-range information dependencies, but it also orchestrates a profound reduction in the computational complexity inherent in the attention operation. The transition from an $O(n^2d)$ complexity, typical of floating-point operations, to an $O(n^2T)$ complexity characterizing bitwise operations, substantiates this advantage. Notably, in this context, the parameter $T$ remains markedly smaller than the conventional dimensionality parameter $d$. The Bitformer model in essence endeavors to reconcile the indomitable requirements of modern computing landscapes with the constraints posed by edge computing scenarios. By forging this innovative path, we bridge the gap between high-performing models and resource-scarce environments, thus unveiling a promising trajectory for further advancements in the field.

Related papers

Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers [33.176126599147466]
EiFormer is an improved inverted transformer architecture that maintains the adaptive capabilities of iTransformer while reducing computational complexity to linear scale. Our key innovation lies in restructuring the attention mechanism to eliminate redundant computations without sacrificing model expressiveness. Our approach enables practical deployment of transformer-based forecasting in industrial applications where handling time series at scale is essential.
arXiv Detail & Related papers (2025-03-13T20:14:08Z)
Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z)
PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture [46.266960248570086]
This study tackles the quadratic complexity of the self-attention mechanism by introducing a complexity local attention mechanism for effective feature aggregation. We also introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel. We show that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.
arXiv Detail & Related papers (2024-08-10T10:16:03Z)
Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge [35.40849522296486]
Large-scale foundation models (FoMos) can perform human-like intelligence. FoMos need to be adapted to specialized downstream tasks through fine-tuning techniques. We advocate multi-device cooperation within the device-edge cooperative fine-tuning paradigm.
arXiv Detail & Related papers (2024-07-13T12:47:14Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
Adaptive Point Transformer [88.28498667506165]
Adaptive Point Cloud Transformer (AdaPT) is a standard PT model augmented by an adaptive token selection mechanism. AdaPT dynamically reduces the number of tokens during inference, enabling efficient processing of large point clouds.
arXiv Detail & Related papers (2024-01-26T13:24:45Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration [73.69415797389195]
We propose an end-to-end transformer network (RegFormer) for large-scale point cloud alignment. Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers. Our transformer has linear complexity, which guarantees high efficiency even for large-scale scenes.
arXiv Detail & Related papers (2023-03-22T08:47:37Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference [9.329021390526124]
Non-linear operations such as GELU, Layer normalization, and Softmax are essential yet costly building blocks of Transformer models. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference.
arXiv Detail & Related papers (2021-12-03T23:06:57Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
ExPAN(N)D: Exploring Posits for Efficient Artificial Neural Network Design in FPGA-based Systems [4.2612881037640085]
This paper analyzes and ingathers the efficacy of the Posit number representation scheme and the efficiency of fixed-point arithmetic implementations for ANNs. We propose a novel Posit to fixed-point converter for enabling high-performance and energy-efficient hardware implementations for ANNs.
arXiv Detail & Related papers (2020-10-24T11:02:25Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.