Multitask vocal burst modeling with ResNets and pre-trained
paralinguistic Conformers
- URL: http://arxiv.org/abs/2206.12494v1
- Date: Fri, 24 Jun 2022 21:42:16 GMT
- Title: Multitask vocal burst modeling with ResNets and pre-trained
paralinguistic Conformers
- Authors: Josh Belanich, Krishna Somandepalli, Brian Eoff, Brendan Jou
- Abstract summary: This report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask)
We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts.
Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics.
- Score: 11.682025726705122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This technical report presents the modeling approaches used in our submission
to the ICML Expressive Vocalizations Workshop & Competition multitask track
(ExVo-MultiTask). We first applied image classification models of various sizes
on mel-spectrogram representations of the vocal bursts, as is standard in sound
event detection literature. Results from these models show an increase of
21.24% over the baseline system with respect to the harmonic mean of the task
metrics, and comprise our team's main submission to the MultiTask track. We
then sought to characterize the headroom in the MultiTask track by applying a
large pre-trained Conformer model that previously achieved state-of-the-art
results on paralinguistic tasks like speech emotion recognition and mask
detection. We additionally investigated the relationship between the sub-tasks
of emotional expression, country of origin, and age prediction, and discovered
that the best performing models are trained as single-task models, questioning
whether the problem truly benefits from a multitask setting.
Related papers
- SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Concrete Subspace Learning based Interference Elimination for Multi-task
Model Fusion [86.6191592951269]
Merging models fine-tuned from common extensively pretrained large model but specialized for different tasks has been demonstrated as a cheap and scalable strategy to construct a multitask model that performs well across diverse tasks.
We propose the CONtinuous relaxation dis (Concrete) subspace learning method to identify a common lowdimensional subspace and utilize its shared information track interference problem without sacrificing performance.
arXiv Detail & Related papers (2023-12-11T07:24:54Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Meta-training with Demonstration Retrieval for Efficient Few-shot
Learning [11.723856248352007]
Large language models show impressive results on few-shot NLP tasks.
These models are memory and computation-intensive.
We propose meta-training with demonstration retrieval.
arXiv Detail & Related papers (2023-06-30T20:16:22Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Multi-modal Multi-label Facial Action Unit Detection with Transformer [7.30287060715476]
This paper describes our submission to the third Affective Behavior Analysis (ABAW) 2022 competition.
We proposed a transfomer based model to detect facial action unit (FAU) in video.
arXiv Detail & Related papers (2022-03-24T18:59:31Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.