Improving Deep-learning-based Semi-supervised Audio Tagging with Mixup
- URL: http://arxiv.org/abs/2102.08183v1
- Date: Tue, 16 Feb 2021 14:33:05 GMT
- Title: Improving Deep-learning-based Semi-supervised Audio Tagging with Mixup
- Authors: L\'eo Cances, Etienne Labb\'e, Thomas Pellegrini
- Abstract summary: Semi-supervised learning (SSL) methods have been shown to provide state-of-the-art results on image datasets by exploiting unlabeled data.
In this article, we adapted four recent SSL methods to the task of audio tagging.
- Score: 2.707154152696381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, semi-supervised learning (SSL) methods, in the framework of deep
learning (DL), have been shown to provide state-of-the-art results on image
datasets by exploiting unlabeled data. Most of the time tested on object
recognition tasks in images, these algorithms are rarely compared when applied
to audio tasks. In this article, we adapted four recent SSL methods to the task
of audio tagging. The first two methods, namely Deep Co-Training (DCT) and Mean
Teacher (MT) involve two collaborative neural networks. The two other
algorithms, called MixMatch (MM) and FixMatch (FM), are single-model methods
that rely primarily on data augmentation strategies. Using the Wide ResNet 28-2
architecture in all our experiments, 10% of labeled data and the remaining 90\%
as unlabeled, we first compare the four methods' accuracy on three standard
benchmark audio event datasets: Environmental Sound Classification (ESC-10),
UrbanSound8K (UBS8K), and Google Speech Commands (GSC). MM and FM outperformed
MT and DCT significantly, MM being the best method in most experiments. On
UBS8K and GSC, in particular, MM achieved 18.02% and 3.25% error rates (ER),
outperforming models trained with 100% of the available labeled data, which
reached 23.29% and 4.94% ER, respectively. Second, we explored the benefits of
using the mixup augmentation in the four algorithms. In almost all cases, mixup
brought significant gains. For instance, on GSC, FM reached 4.44% and 3.31% ER
without and with mixup.
Related papers
- Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings [7.42741711946564]
We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks.
In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
arXiv Detail & Related papers (2023-06-01T14:00:47Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Adaptive Few-Shot Learning Algorithm for Rare Sound Event Detection [24.385226516231004]
We propose a novel task-adaptive module which is easy to plant into any metric-based few-shot learning frameworks.
Our module improves the performance considerably on two datasets over baseline methods.
arXiv Detail & Related papers (2022-05-24T03:13:12Z) - Towards Semi-Supervised Deep Facial Expression Recognition with An
Adaptive Confidence Margin [92.76372026435858]
We learn an Adaptive Confidence Margin (Ada-CM) to fully leverage all unlabeled data for semi-supervised deep facial expression recognition.
All unlabeled samples are partitioned into two subsets by comparing their confidence scores with the adaptively learned confidence margin.
Our method achieves state-of-the-art performance, especially surpassing fully-supervised baselines in a semi-supervised manner.
arXiv Detail & Related papers (2022-03-23T11:43:29Z) - Robust Segmentation Models using an Uncertainty Slice Sampling Based
Annotation Workflow [5.051373749267151]
We propose an uncertainty slice sampling (USS) strategy for semantic segmentation of 3D medical volumes.
We demonstrate the efficiency of USS on a liver segmentation task using multi-site data.
arXiv Detail & Related papers (2021-09-30T06:56:11Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - AlphaMatch: Improving Consistency for Semi-supervised Learning with
Alpha-divergence [44.88886269629515]
Semi-supervised learning (SSL) is a key approach toward more data-efficient machine learning by jointly leverage both labeled and unlabeled data.
We propose AlphaMatch, an efficient SSL method that leverages data augmentations, by efficiently enforcing the label consistency between the data points and the augmented data derived from them.
arXiv Detail & Related papers (2020-11-23T22:43:45Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - iTAML: An Incremental Task-Agnostic Meta-learning Approach [123.10294801296926]
Humans can continuously learn new knowledge as their experience grows.
Previous learning in deep neural networks can quickly fade out when they are trained on a new task.
We introduce a novel meta-learning approach that seeks to maintain an equilibrium between all encountered tasks.
arXiv Detail & Related papers (2020-03-25T21:42:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.