SepTr: Separable Transformer for Audio Spectrogram Processing
- URL: http://arxiv.org/abs/2203.09581v1
- Date: Thu, 17 Mar 2022 19:48:43 GMT
- Title: SepTr: Separable Transformer for Audio Spectrogram Processing
- Authors: Nicolae-Catalin Ristea, Radu Tudor Ionescu, Fahad Shahbaz Khan
- Abstract summary: We propose a new vision transformer architecture called Separable Transformer (SepTr)
SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval.
We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
- Score: 74.41172054754928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the successful application of vision transformers in multiple
computer vision tasks, these models have drawn the attention of the signal
processing community. This is because signals are often represented as
spectrograms (e.g. through Discrete Fourier Transform) which can be directly
provided as input to vision transformers. However, naively applying
transformers to spectrograms is suboptimal. Since the axes represent distinct
dimensions, i.e. frequency and time, we argue that a better approach is to
separate the attention dedicated to each axis. To this end, we propose the
Separable Transformer (SepTr), an architecture that employs two transformer
blocks in a sequential manner, the first attending to tokens within the same
frequency bin, and the second attending to tokens within the same time
interval. We conduct experiments on three benchmark data sets, showing that our
separable architecture outperforms conventional vision transformers and other
state-of-the-art methods. Unlike standard transformers, SepTr linearly scales
the number of trainable parameters with the input size, thus having a lower
memory footprint. Our code is available as open source at
https://github.com/ristea/septr.
Related papers
- iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Machine Learning for Brain Disorders: Transformers and Visual
Transformers [4.186575888568896]
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision.
We introduce the Attention mechanism (Section 1), and then the Basic Transformer Block including the Vision Transformer.
Finally, we introduce Visual Transformers applied to tasks other than image classification, such as detection, segmentation, generation and training without labels.
arXiv Detail & Related papers (2023-03-21T17:57:33Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - Boosting vision transformers for image retrieval [11.441395750267052]
Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection.
However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks.
We propose a number of improvements that make transformers outperform the state of the art for the first time.
arXiv Detail & Related papers (2022-10-21T12:17:12Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.