Generalizing AUC Optimization to Multiclass Classification for Audio
Segmentation With Limited Training Data
- URL: http://arxiv.org/abs/2110.14425v1
- Date: Wed, 27 Oct 2021 13:36:04 GMT
- Title: Generalizing AUC Optimization to Multiclass Classification for Audio
Segmentation With Limited Training Data
- Authors: Pablo Gimeno, Victoria Mingote, Alfonso Ortega, Antonio Miguel,
Eduardo Lleida
- Abstract summary: Area under the ROC curve (AUC) optimisation techniques have recently demonstrated their capabilities in different audio and speech related tasks.
We introduce an extension to the AUC optimisation framework so that it can be easily applied to an arbitrary number of classes.
- Score: 20.943224434364517
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Area under the ROC curve (AUC) optimisation techniques developed for neural
networks have recently demonstrated their capabilities in different audio and
speech related tasks. However, due to its intrinsic nature, AUC optimisation
has focused only on binary tasks so far. In this paper, we introduce an
extension to the AUC optimisation framework so that it can be easily applied to
an arbitrary number of classes, aiming to overcome the issues derived from
training data limitations in deep learning solutions. Building upon the
multiclass definitions of the AUC metric found in the literature, we define two
new training objectives using a one-versus-one and a one-versus-rest approach.
In order to demonstrate its potential, we apply them in an audio segmentation
task with limited training data that aims to differentiate 3 classes:
foreground music, background music and no music. Experimental results show that
our proposal can improve the performance of audio segmentation systems
significantly compared to traditional training criteria such as cross entropy.
Related papers
- Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation [19.20874993309959]
vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks.
We propose a baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP)
Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature.
arXiv Detail & Related papers (2024-04-12T01:08:04Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Multimodal Parameter-Efficient Few-Shot Class Incremental Learning [1.9220716793379256]
Few-Shot Class Incremental Learning (FSCIL) is a challenging continual learning task, where limited training examples are available during several learning sessions.
To succeed in this task, it is necessary to avoid over-fitting new classes caused by biased distributions in the few-shot training sets.
CPE-CLIP significantly improves FSCIL performance compared to state-of-the-art proposals while also drastically reducing the number of learnable parameters and training costs.
arXiv Detail & Related papers (2023-03-08T17:34:15Z) - Multi-source Domain Adaptation for Text-independent Forensic Speaker
Recognition [36.83842373791537]
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model.
Previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains.
Three novel adaptation methods are proposed to further promote adaptation performance across multiple acoustic domains.
arXiv Detail & Related papers (2022-11-17T22:11:25Z) - Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information [77.80071279597665]
We propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training)
Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation.
arXiv Detail & Related papers (2022-11-17T18:59:49Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Activation to Saliency: Forming High-Quality Labels for Unsupervised
Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues.
No human annotations are involved in our framework during the whole training process.
Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.