Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion,
Age, and Origin from Vocal Bursts
- URL: http://arxiv.org/abs/2206.12469v1
- Date: Fri, 24 Jun 2022 18:57:41 GMT
- Title: Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion,
Age, and Origin from Vocal Bursts
- Authors: Atijit Anuchitanukul and Lucia Specia
- Abstract summary: Burst2Vec uses pre-trained speech representations to capture acoustic information from raw waveforms.
Our models achieve a relative 30 % performance gain over baselines using pre-extracted features.
- Score: 49.31604138034298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Burst2Vec, our multi-task learning approach to predict emotion,
age, and origin (i.e., native country/language) from vocal bursts. Burst2Vec
utilises pre-trained speech representations to capture acoustic information
from raw waveforms and incorporates the concept of model debiasing via
adversarial training. Our models achieve a relative 30 % performance gain over
baselines using pre-extracted features and score the highest amongst all
participants in the ICML ExVo 2022 Multi-Task Challenge.
Related papers
- A Comparative Study of Pre-trained Speech and Audio Embeddings for
Speech Emotion Recognition [0.0]
Speech Emotion Recognition (SER) has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning.
Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks.
We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms on the derived embeddings.
The results of our study indicate that the best performance is achieved by algorithms trained on embeddings
arXiv Detail & Related papers (2023-04-22T19:56:35Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [20.894029832911617]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - An Efficient Multitask Learning Architecture for Affective Vocal Burst
Analysis [1.2951621755732543]
Current approaches to address affective vocal burst analysis are mostly based on wav2vec2 or HuBERT features.
In this paper, we investigate the use of the wav2vec successor data2vec in combination with a multitask learning pipeline to tackle different analysis problems at once.
To assess the performance of our efficient multitask learning architecture, we participate in the 2022 ACII Affective Vocal Burst Challenge.
arXiv Detail & Related papers (2022-09-28T08:32:08Z) - Multitask vocal burst modeling with ResNets and pre-trained
paralinguistic Conformers [11.682025726705122]
This report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask)
We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts.
Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics.
arXiv Detail & Related papers (2022-06-24T21:42:16Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Transferring Voice Knowledge for Acoustic Event Detection: An Empirical
Study [11.825240267691209]
This paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an acoustic event detection pipeline.
We develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process.
arXiv Detail & Related papers (2021-10-07T04:03:21Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.