Large Scale Audio Understanding without Transformers/ Convolutions/
BERTs/ Mixers/ Attention/ RNNs or ....
- URL: http://arxiv.org/abs/2110.03183v2
- Date: Fri, 8 Oct 2021 18:17:09 GMT
- Title: Large Scale Audio Understanding without Transformers/ Convolutions/
BERTs/ Mixers/ Attention/ RNNs or ....
- Authors: Prateek Verma
- Abstract summary: This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures.
Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT.
A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation.
- Score: 4.594159253008448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a way of doing large scale audio understanding without
traditional state of the art neural architectures. Ever since the introduction
of deep learning for understanding audio signals in the past decade,
convolutional architectures have been able to achieve state of the art results
surpassing traditional hand-crafted features. In the recent past, there has
been a similar shift away from traditional convolutional and recurrent neural
networks towards purely end-to-end Transformer architectures. We, in this work,
explore an approach, based on Bag-of-Words model. Our approach does not have
any convolutions, recurrence, attention, transformers or other approaches such
as BERT. We utilize micro and macro level clustered vanilla embeddings, and use
a MLP head for classification. We only use feed-forward encoder-decoder models
to get the bottlenecks of spectral envelops, spectral patches and slices as
well as multi-resolution spectra. A classification head (a feed-forward layer),
similar to the approach in SimCLR is trained on a learned representation. Using
simple codes learned on latent representations, we show how we surpass
traditional convolutional neural network architectures, and come strikingly
close to outperforming powerful Transformer architectures. This work hopefully
would pave way for exciting advancements in the field of representation
learning without massive, end-to-end neural architectures.
Related papers
- WaveletGPT: Wavelets Meet Large Language Models [1.2328446298523066]
Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements.
This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure.
We achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music.
arXiv Detail & Related papers (2024-09-04T03:17:19Z) - DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical
Image Segmentation [11.190117191084175]
This paper proposes TransDeepLab, a novel DeepLab-like pure Transformer for medical image segmentation.
We exploit hierarchical Swin-Transformer with shifted windows to extend the DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module.
Our approach performs superior or on par with most contemporary works on an amalgamation of Vision Transformer and CNN-based methods.
arXiv Detail & Related papers (2022-08-01T09:53:53Z) - Adaptive Convolutional Dictionary Network for CT Metal Artifact
Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction.
Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image.
Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals.
Our model outperforms convolutional models to produce state of the art results.
We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Spatio-Temporal Inception Graph Convolutional Networks for
Skeleton-Based Action Recognition [126.51241919472356]
We design a simple and highly modularized graph convolutional network architecture for skeleton-based action recognition.
Our network is constructed by repeating a building block that aggregates multi-granularity information from both the spatial and temporal paths.
arXiv Detail & Related papers (2020-11-26T14:43:04Z) - DeepRx MIMO: Convolutional MIMO Detection with Learned Multiplicative
Transformations [7.775752249659354]
We present a deep learning-based receiver architecture that consists of a ResNet-based convolutional neural network, also known as DeepRx, combined with a so-called transformation layer, all trained together.
To the best of our knowledge, these are some of the first results showing such high performance for a fully learned receiver.
arXiv Detail & Related papers (2020-10-30T14:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.