Related papers: Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

URL: http://arxiv.org/abs/2110.03183v2
Date: Fri, 8 Oct 2021 18:17:09 GMT
Title: Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....
Authors: Prateek Verma
Abstract summary: This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation.
Score: 4.594159253008448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely end-to-end Transformer architectures. We, in this work, explore an approach, based on Bag-of-Words model. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. We utilize micro and macro level clustered vanilla embeddings, and use a MLP head for classification. We only use feed-forward encoder-decoder models to get the bottlenecks of spectral envelops, spectral patches and slices as well as multi-resolution spectra. A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation. Using simple codes learned on latent representations, we show how we surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures. This work hopefully would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

Related papers

NN-Former: Rethinking Graph Structure in Neural Architecture Representation [67.3378579108611]
Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures.<n>We show that sibling nodes are pivotal while overlooked in previous research.<n>Our approach consistently achieves promising performance in both accuracy and latency prediction.
arXiv Detail & Related papers (2025-07-01T15:46:18Z)
DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer [1.456352735394398]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)<n> Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.<n>These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases.
arXiv Detail & Related papers (2025-06-15T22:42:57Z)
WaveletGPT: Wavelets Meet Large Language Models [1.2328446298523066]
Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. We achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music.
arXiv Detail & Related papers (2024-09-04T03:17:19Z)
DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs) Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z)
High-Performance Transformers for Table Structure Recognition Need Early Convolutions [25.04573593082671]
Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. We design a lightweight visual encoder for table structure recognition (TSR) without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model.
arXiv Detail & Related papers (2023-11-09T18:20:52Z)
TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical Image Segmentation [11.190117191084175]
This paper proposes TransDeepLab, a novel DeepLab-like pure Transformer for medical image segmentation. We exploit hierarchical Swin-Transformer with shifted windows to extend the DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module. Our approach performs superior or on par with most contemporary works on an amalgamation of Vision Transformer and CNN-based methods.
arXiv Detail & Related papers (2022-08-01T09:53:53Z)
Adaptive Convolutional Dictionary Network for CT Metal Artifact Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction. Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image. Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation. tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals. Our model outperforms convolutional models to produce state of the art results. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition [126.51241919472356]
We design a simple and highly modularized graph convolutional network architecture for skeleton-based action recognition. Our network is constructed by repeating a building block that aggregates multi-granularity information from both the spatial and temporal paths.
arXiv Detail & Related papers (2020-11-26T14:43:04Z)
DeepRx MIMO: Convolutional MIMO Detection with Learned Multiplicative Transformations [7.775752249659354]
We present a deep learning-based receiver architecture that consists of a ResNet-based convolutional neural network, also known as DeepRx, combined with a so-called transformation layer, all trained together. To the best of our knowledge, these are some of the first results showing such high performance for a fully learned receiver.
arXiv Detail & Related papers (2020-10-30T14:11:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.