Related papers: Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

URL: http://arxiv.org/abs/2105.00335v1
Date: Sat, 1 May 2021 19:38:30 GMT
Title: Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions
Authors: Prateek Verma and Jonathan Berger
Abstract summary: We propose applying Transformer based architectures without convolutional layers to raw audio signals. Our model outperforms convolutional models to produce state of the art results. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
Score: 6.370905925442655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

Related papers

Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders. During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
Content Adaptive Front End For Audio Signal Processing [2.8935588665357077]
We propose a learnable content adaptive front end for audio signal processing. We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector.
arXiv Detail & Related papers (2023-03-18T16:09:10Z)
Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers [6.002503434201551]
We study the use of audio transformers trained on large-scale datasets to learn general-purpose representations. Our results show that representations extracted by audio transformers outperform CNN representations.
arXiv Detail & Related papers (2022-11-25T08:39:12Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
A Language Model With Million Sample Context For Raw Audio Using Transformer Architectures [2.8935588665357077]
We propose a generative auto-regressive architecture that can model audio waveforms over a large context. Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders. We achieve a state-of-the-art performance as compared to other approaches such as Wavenet, SaSHMI, and Sample-RNN.
arXiv Detail & Related papers (2022-06-16T16:57:43Z)
Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT) We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z)
Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers [52.30336730712544]
We introduce a deep reinforcement learning architecture whose purpose is to increase sample efficiency without sacrificing performance. We propose a visually attentive model that uses transformers to learn a self-attention mechanism on the feature maps of the state representation. We demonstrate empirically that this architecture improves sample complexity for several Atari environments, while also achieving better performance in some of the games.
arXiv Detail & Related papers (2022-02-01T19:03:03Z)
Efficient Training of Audio Transformers with Patchout [7.073210405344709]
We propose a novel method to optimize and regularize transformers on audio spectrograms. The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
arXiv Detail & Related papers (2021-10-11T08:07:50Z)
Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time. We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z)
Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We focus on the scene context provided by the visual information, to ground the ASR. Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.