Related papers: SepTr: Separable Transformer for Audio Spectrogram Processing

SepTr: Separable Transformer for Audio Spectrogram Processing

URL: http://arxiv.org/abs/2203.09581v1
Date: Thu, 17 Mar 2022 19:48:43 GMT
Title: SepTr: Separable Transformer for Audio Spectrogram Processing
Authors: Nicolae-Catalin Ristea, Radu Tudor Ionescu, Fahad Shahbaz Khan
Abstract summary: We propose a new vision transformer architecture called Separable Transformer (SepTr) SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
Score: 74.41172054754928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.

Related papers

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
Machine Learning for Brain Disorders: Transformers and Visual Transformers [4.186575888568896]
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. We introduce the Attention mechanism (Section 1), and then the Basic Transformer Block including the Vision Transformer. Finally, we introduce Visual Transformers applied to tasks other than image classification, such as detection, segmentation, generation and training without labels.
arXiv Detail & Related papers (2023-03-21T17:57:33Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
Boosting vision transformers for image retrieval [11.441395750267052]
Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks. We propose a number of improvements that make transformers outperform the state of the art for the first time.
arXiv Detail & Related papers (2022-10-21T12:17:12Z)
Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.