Play It Back: Iterative Attention for Audio Recognition
- URL: http://arxiv.org/abs/2210.11328v1
- Date: Thu, 20 Oct 2022 15:03:22 GMT
- Title: Play It Back: Iterative Attention for Audio Recognition
- Authors: Alexandros Stergiou and Dima Damen
- Abstract summary: A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds.
We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
- Score: 104.628661890361
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key function of auditory cognition is the association of characteristic
sounds with their corresponding semantics over time. Humans attempting to
discriminate between fine-grained audio categories, often replay the same
discriminative sounds to increase their prediction confidence. We propose an
end-to-end attention-based architecture that through selective repetition
attends over the most discriminative sounds across the audio sequence. Our
model initially uses the full audio sequence and iteratively refines the
temporal segments replayed based on slot attention. At each playback, the
selected segments are replayed using a smaller hop length which represents
higher resolution features within these segments. We show that our method can
consistently achieve state-of-the-art performance across three
audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.
Related papers
- Multi-label Zero-Shot Audio Classification with Temporal Attention [8.518434546898524]
The present study introduces a method to perform multi-label zero-shot audio classification.
We adapt temporal attention to assign importance weights to different audio segments based on their acoustic and semantic compatibility.
Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.
arXiv Detail & Related papers (2024-08-31T09:49:41Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z) - You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection [0.0]
We present a novel approach called You Only Hear Once (YOHO)
We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network.
arXiv Detail & Related papers (2021-09-01T12:50:16Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Neural Audio Fingerprint for High-specific Audio Retrieval based on
Contrastive Learning [14.60531205031547]
We present a contrastive learning framework that derives from the segment-level search objective.
In the segment-level search task, where the conventional audio fingerprinting systems used to fail, our system using 10x smaller storage has shown promising results.
arXiv Detail & Related papers (2020-10-22T17:44:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.