PoNet: Pooling Network for Efficient Token Mixing in Long Sequences
- URL: http://arxiv.org/abs/2110.02442v4
- Date: Mon, 22 May 2023 09:07:17 GMT
- Title: PoNet: Pooling Network for Efficient Token Mixing in Long Sequences
- Authors: Chao-Hong Tan, Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng,
Zhen-Hua Ling
- Abstract summary: We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity.
On the Long Range Arena benchmark, PoNet significantly outperforms Transformer and achieves competitive accuracy.
- Score: 34.657602765639375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have achieved great success in various NLP, vision,
and speech tasks. However, the core of Transformer, the self-attention
mechanism, has a quadratic time and memory complexity with respect to the
sequence length, which hinders applications of Transformer-based models to long
sequences. Many approaches have been proposed to mitigate this problem, such as
sparse attention mechanisms, low-rank matrix approximations and scalable
kernels, and token mixing alternatives to self-attention. We propose a novel
Pooling Network (PoNet) for token mixing in long sequences with linear
complexity. We design multi-granularity pooling and pooling fusion to capture
different levels of contextual information and combine their interactions with
tokens. On the Long Range Arena benchmark, PoNet significantly outperforms
Transformer and achieves competitive accuracy, while being only slightly slower
than the fastest model, FNet, across all sequence lengths measured on GPUs. We
also conduct systematic studies on the transfer learning capability of PoNet
and observe that PoNet achieves 95.7% of the accuracy of BERT on the GLUE
benchmark, outperforming FNet by 4.5% relative. Comprehensive ablation analysis
demonstrates effectiveness of the designed multi-granularity pooling and
pooling fusion for token mixing in long sequences and efficacy of the designed
pre-training tasks for PoNet to learn transferable contextualized language
representations.
Related papers
- Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - TFDMNet: A Novel Network Structure Combines the Time Domain and
Frequency Domain Features [34.91485245048524]
This paper proposes a novel Element-wise Multiplication Layer (EML) to replace convolution layers.
We also introduce a Weight Fixation mechanism to alleviate the problem of over-fitting.
Experimental results imply that TFDMNet achieves good performance on MNIST, CIFAR-10 and ImageNet databases.
arXiv Detail & Related papers (2024-01-29T08:18:21Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - MogaNet: Multi-order Gated Aggregation Network [64.16774341908365]
We propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning.
MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module.
MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet.
arXiv Detail & Related papers (2022-11-07T04:31:17Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - MSCFNet: A Lightweight Network With Multi-Scale Context Fusion for
Real-Time Semantic Segmentation [27.232578592161673]
We devise a novel lightweight network using a multi-scale context fusion scheme (MSCFNet)
The proposed MSCFNet contains only 1.15M parameters, achieves 71.9% Mean IoU and can run at over 50 FPS on a single Titan XP GPU configuration.
arXiv Detail & Related papers (2021-03-24T08:28:26Z) - Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences [3.8848561367220276]
We present a simple and lightweight variant of the Shuffle-Exchange network, which is based on a residual network employing GELU and Layer Normalization.
The proposed architecture not only scales to longer sequences but also converges faster and provides better accuracy.
It surpasses the Shuffle-Exchange network on the LAMBADA language modelling task and achieves state-of-the-art performance on the MusicNet dataset for music transcription.
arXiv Detail & Related papers (2020-04-06T12:44:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.