Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions
- URL: http://arxiv.org/abs/2105.00335v1
- Date: Sat, 1 May 2021 19:38:30 GMT
- Title: Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions
- Authors: Prateek Verma and Jonathan Berger
- Abstract summary: We propose applying Transformer based architectures without convolutional layers to raw audio signals.
Our model outperforms convolutional models to produce state of the art results.
We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
- Score: 6.370905925442655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over the past two decades, CNN architectures have produced compelling models
of sound perception and cognition, learning hierarchical organizations of
features. Analogous to successes in computer vision, audio feature
classification can be optimized for a particular task of interest, over a wide
variety of datasets and labels. In fact similar architectures designed for
image understanding have proven effective for acoustic scene analysis. Here we
propose applying Transformer based architectures without convolutional layers
to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200
categories, our model outperforms convolutional models to produce state of the
art results. This is significant as unlike in natural language processing and
computer vision, we do not perform unsupervised pre-training for outperforming
convolutional architectures. On the same training set, with respect mean
aver-age precision benchmarks, we show a significant improvement. We further
improve the performance of Transformer architectures by using techniques such
as pooling inspired from convolutional net-work designed in the past few years.
In addition, we also show how multi-rate signal processing ideas inspired from
wavelets, can be applied to the Transformer embeddings to improve the results.
We also show how our models learns a non-linear non constant band-width
filter-bank, which shows an adaptable time frequency front end representation
for the task of audio understanding, different from other tasks e.g. pitch
estimation.
Related papers
- Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Content Adaptive Front End For Audio Signal Processing [2.8935588665357077]
We propose a learnable content adaptive front end for audio signal processing.
We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector.
arXiv Detail & Related papers (2023-03-18T16:09:10Z) - Learning General Audio Representations with Large-Scale Training of
Patchout Audio Transformers [6.002503434201551]
We study the use of audio transformers trained on large-scale datasets to learn general-purpose representations.
Our results show that representations extracted by audio transformers outperform CNN representations.
arXiv Detail & Related papers (2022-11-25T08:39:12Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - A Language Model With Million Sample Context For Raw Audio Using
Transformer Architectures [2.8935588665357077]
We propose a generative auto-regressive architecture that can model audio waveforms over a large context.
Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders.
We achieve a state-of-the-art performance as compared to other approaches such as Wavenet, SaSHMI, and Sample-RNN.
arXiv Detail & Related papers (2022-06-16T16:57:43Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Improving Sample Efficiency of Value Based Models Using Attention and
Vision Transformers [52.30336730712544]
We introduce a deep reinforcement learning architecture whose purpose is to increase sample efficiency without sacrificing performance.
We propose a visually attentive model that uses transformers to learn a self-attention mechanism on the feature maps of the state representation.
We demonstrate empirically that this architecture improves sample complexity for several Atari environments, while also achieving better performance in some of the games.
arXiv Detail & Related papers (2022-02-01T19:03:03Z) - Efficient Training of Audio Transformers with Patchout [7.073210405344709]
We propose a novel method to optimize and regularize transformers on audio spectrograms.
The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
arXiv Detail & Related papers (2021-10-11T08:07:50Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.