Audiovisual transfer learning for audio tagging and sound event
detection
- URL: http://arxiv.org/abs/2106.05408v1
- Date: Wed, 9 Jun 2021 21:55:05 GMT
- Title: Audiovisual transfer learning for audio tagging and sound event
detection
- Authors: Wim Boes, Hugo Van hamme
- Abstract summary: We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection.
We adapt a baseline system utilizing only spectral acoustic inputs to make use of pretrained auditory and visual features.
We perform experiments with these modified models on an audiovisual multi-label data set.
- Score: 21.574781022415372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the merit of transfer learning for two sound recognition problems,
i.e., audio tagging and sound event detection. Employing feature fusion, we
adapt a baseline system utilizing only spectral acoustic inputs to also make
use of pretrained auditory and visual features, extracted from networks built
for different tasks and trained with external data. We perform experiments with
these modified models on an audiovisual multi-label data set, of which the
training partition contains a large number of unlabeled samples and a smaller
amount of clips with weak annotations, indicating the clip-level presence of 10
sound categories without specifying the temporal boundaries of the active
auditory events. For clip-based audio tagging, this transfer learning method
grants marked improvements. Addition of the visual modality on top of audio
also proves to be advantageous in this context. When it comes to generating
transcriptions of audio recordings, the benefit of pretrained features depends
on the requested temporal resolution: for coarse-grained sound event detection,
their utility remains notable. But when more fine-grained predictions are
required, performance gains are strongly reduced due to a mismatch between the
problem at hand and the goals of the models from which the pretrained vectors
were obtained.
Related papers
- Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - An investigation on selecting audio pre-trained models for audio
captioning [5.837881923712393]
Pre-trained models are widely used in audio captioning due to high complexity.
Unless a comprehensive system is re-trained, it is hard to determine how well pre-trained models contribute to audio captioning system.
In this paper, a series of pre-trained models are investigated for the correlation between extracted audio features and the performance of audio captioning.
arXiv Detail & Related papers (2022-08-12T06:14:20Z) - Improving Polyphonic Sound Event Detection on Multichannel Recordings
with the S{\o}rensen-Dice Coefficient Loss and Transfer Learning [15.088901748728391]
polyphonic sound event detection systems trained with Dice loss consistently outperformed those trained with cross-entropy loss.
We achieved further performance gains via the use of transfer learning and an appropriate combination of different data augmentation techniques.
arXiv Detail & Related papers (2021-07-22T06:14:23Z) - Cross-Referencing Self-Training Network for Sound Event Detection in
Audio Mixtures [23.568610919253352]
This paper proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training.
The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.
arXiv Detail & Related papers (2021-05-27T18:46:59Z) - Unsupervised Discriminative Learning of Sounds for Audio Event
Classification [43.81789898864507]
Network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet.
We show a fast and effective alternative that pre-trains the model unsupervised, only on audio data and yet delivers on-par performance with ImageNet pre-training.
arXiv Detail & Related papers (2021-05-19T17:42:03Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.