Study of positional encoding approaches for Audio Spectrogram
Transformers
- URL: http://arxiv.org/abs/2110.06999v1
- Date: Wed, 13 Oct 2021 19:20:20 GMT
- Title: Study of positional encoding approaches for Audio Spectrogram
Transformers
- Authors: Leonardo Pepino and Pablo Riera and Luciana Ferrer
- Abstract summary: In this paper, we study one component of the Audio Spectrogram Transformer (AST) and propose several variants to improve its performance.
Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST.
- Score: 16.829474982595837
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have revolutionized the world of deep learning, specially in the
field of natural language processing. Recently, the Audio Spectrogram
Transformer (AST) was proposed for audio classification, leading to state of
the art results in several datasets. However, in order for ASTs to outperform
CNNs, pretraining with ImageNet is needed. In this paper, we study one
component of the AST, the positional encoding, and propose several variants to
improve the performance of ASTs trained from scratch, without ImageNet
pretraining. Our best model, which incorporates conditional positional
encodings, significantly improves performance on Audioset and ESC-50 compared
to the original AST.
Related papers
- ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions [15.472819870523093]
Transformer-based models, such as the Audio Spectrogram Transformers (AST), inherit the fixed-size input paradigm from CNNs.
This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference.
arXiv Detail & Related papers (2024-07-11T17:29:56Z) - From Coarse to Fine: Efficient Training for Audio Spectrogram
Transformers [16.90294414874585]
We introduce multi-phase training of audio spectrogram transformers by connecting the idea of coarse-to-fine with transformer models.
By employing one of these methods, the transformer model learns from lower-resolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy.
arXiv Detail & Related papers (2024-01-16T14:59:37Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Audio-Visual Scene Classification Using A Transfer Learning Based Joint
Optimization Strategy [26.975596225131824]
We propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task.
Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training.
arXiv Detail & Related papers (2022-04-25T03:37:02Z) - SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio.
We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification.
To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z) - Efficient Training of Audio Transformers with Patchout [7.073210405344709]
We propose a novel method to optimize and regularize transformers on audio spectrograms.
The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
arXiv Detail & Related papers (2021-10-11T08:07:50Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.