Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences
- URL: http://arxiv.org/abs/2004.04662v4
- Date: Fri, 15 Jan 2021 00:33:19 GMT
- Title: Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences
- Authors: Andis Draguns, Em\=ils Ozoli\c{n}\v{s}, Agris \v{S}ostaks, Mat\=iss
Apinis, K\=arlis Freivalds
- Abstract summary: We present a simple and lightweight variant of the Shuffle-Exchange network, which is based on a residual network employing GELU and Layer Normalization.
The proposed architecture not only scales to longer sequences but also converges faster and provides better accuracy.
It surpasses the Shuffle-Exchange network on the LAMBADA language modelling task and achieves state-of-the-art performance on the MusicNet dataset for music transcription.
- Score: 3.8848561367220276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention is a commonly used mechanism in sequence processing, but it is of
O(n^2) complexity which prevents its application to long sequences. The
recently introduced neural Shuffle-Exchange network offers a
computation-efficient alternative, enabling the modelling of long-range
dependencies in O(n log n) time. The model, however, is quite complex,
involving a sophisticated gating mechanism derived from the Gated Recurrent
Unit. In this paper, we present a simple and lightweight variant of the
Shuffle-Exchange network, which is based on a residual network employing GELU
and Layer Normalization. The proposed architecture not only scales to longer
sequences but also converges faster and provides better accuracy. It surpasses
the Shuffle-Exchange network on the LAMBADA language modelling task and
achieves state-of-the-art performance on the MusicNet dataset for music
transcription while being efficient in the number of parameters. We show how to
combine the improved Shuffle-Exchange network with convolutional layers,
establishing it as a useful building block in long sequence processing
applications.
Related papers
- LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Sequence Modeling with Multiresolution Convolutional Memory [27.218134279968062]
We build a new building block for sequence modeling called a MultiresLayer.
The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence.
Our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks.
arXiv Detail & Related papers (2023-05-02T17:50:54Z) - AEGNN: Asynchronous Event-based Graph Neural Networks [54.528926463775946]
Event-based Graph Neural Networks generalize standard GNNs to process events as "evolving"-temporal graphs.
AEGNNs are easily trained on synchronous inputs and can be converted to efficient, "asynchronous" networks at test time.
arXiv Detail & Related papers (2022-03-31T16:21:12Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Efficient Long Sequence Encoding via Synchronization [29.075962393432857]
We propose a synchronization mechanism for hierarchical encoding.
Our approach first identifies anchor tokens across segments and groups them by their roles in the original input sequence.
Our approach is able to improve the global information exchange among segments while maintaining efficiency.
arXiv Detail & Related papers (2022-03-15T04:37:02Z) - Deep Explicit Duration Switching Models for Time Series [84.33678003781908]
We propose a flexible model that is capable of identifying both state- and time-dependent switching dynamics.
State-dependent switching is enabled by a recurrent state-to-switch connection.
An explicit duration count variable is used to improve the time-dependent switching behavior.
arXiv Detail & Related papers (2021-10-26T17:35:21Z) - PoNet: Pooling Network for Efficient Token Mixing in Long Sequences [34.657602765639375]
We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity.
On the Long Range Arena benchmark, PoNet significantly outperforms Transformer and achieves competitive accuracy.
arXiv Detail & Related papers (2021-10-06T01:07:54Z) - Oscillatory Fourier Neural Network: A Compact and Efficient Architecture
for Sequential Processing [16.69710555668727]
We propose a novel neuron model that has cosine activation with a time varying component for sequential processing.
The proposed neuron provides an efficient building block for projecting sequential inputs into spectral domain.
Applying the proposed model to sentiment analysis on IMDB dataset reaches 89.4% test accuracy within 5 epochs.
arXiv Detail & Related papers (2021-09-14T19:08:07Z) - ShuffleBlock: Shuffle to Regularize Deep Convolutional Neural Networks [35.67192058479252]
This paper studies the operation of channel shuffle as a regularization technique in deep convolutional networks.
We show that while random shuffling of channels during training drastically reduce their performance, however, randomly shuffling small patches significantly improves their performance.
The ShuffleBlock module is easy to implement and improves the performance of several baseline networks on the task of image classification on CIFAR and ImageNet datasets.
arXiv Detail & Related papers (2021-06-17T10:23:00Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.