Self-supervision and Learnable STRFs for Age, Emotion, and Country
Prediction
- URL: http://arxiv.org/abs/2206.12568v1
- Date: Sat, 25 Jun 2022 06:09:10 GMT
- Title: Self-supervision and Learnable STRFs for Age, Emotion, and Country
Prediction
- Authors: Roshan Sharma, Tyler Vuong, Mark Lindsey, Hira Dhamyal, Rita Singh and
Bhiksha Raj
- Abstract summary: This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio.
We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets.
We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412.
- Score: 26.860736835176617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents a multitask approach to the simultaneous estimation of
age, country of origin, and emotion given vocal burst audio for the 2022 ICML
Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice
utilized a combination of spectro-temporal modulation and self-supervised
features, followed by an encoder-decoder network organized in a multitask
paradigm. We evaluate the complementarity between the tasks posed by examining
independent task-specific and joint models, and explore the relative strengths
of different feature sets. We also introduce a simple score fusion mechanism to
leverage the complementarity of different feature sets for this task.
We find that robust data preprocessing in conjunction with score fusion over
spectro-temporal receptive field and HuBERT models achieved our best
ExVo-MultiTask test score of 0.412.
Related papers
- Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.
SSL features conflict with traditional spectral features like FBanks in terms of update directions.
We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z) - Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation [3.8570045844185237]
We present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset.
Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems.
We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix.
arXiv Detail & Related papers (2024-08-05T14:34:40Z) - TSLANet: Rethinking Transformers for Time Series Representation Learning [19.795353886621715]
Time series data is characterized by its intrinsic long and short-range dependencies.
We introduce a novel Time Series Lightweight Network (TSLANet) as a universal convolutional model for diverse time series tasks.
Our experiments demonstrate that TSLANet outperforms state-of-the-art models in various tasks spanning classification, forecasting, and anomaly detection.
arXiv Detail & Related papers (2024-04-12T13:41:29Z) - Toward Fully Self-Supervised Multi-Pitch Estimation [21.000057864087164]
We present a suite of self-supervised learning objectives for multi-pitch estimation.
These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly.
Our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
arXiv Detail & Related papers (2024-02-23T19:12:41Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Multitask vocal burst modeling with ResNets and pre-trained
paralinguistic Conformers [11.682025726705122]
This report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask)
We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts.
Results from these models show an increase of 21.24% over the baseline system with respect to the harmonic mean of the task metrics.
arXiv Detail & Related papers (2022-06-24T21:42:16Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Multi-modal Automated Speech Scoring using Attention Fusion [46.94442359735952]
We propose a novel multi-modal end-to-end neural approach for automated assessment of non-native English speakers' spontaneous speech.
We employ Bi-directional Recurrent Convolutional Neural Networks and Bi-directional Long Short-Term Memory Neural Networks to encode acoustic and lexical cues from spectrograms and transcriptions.
We find combined attention to both lexical and acoustic cues significantly improves the overall performance of the system.
arXiv Detail & Related papers (2020-05-17T07:53:15Z) - Stepwise Model Selection for Sequence Prediction via Deep Kernel
Learning [100.83444258562263]
We propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting.
In order to solve the resulting multiple black-box function optimization problem jointly and efficiently, we exploit potential correlations among black-box functions.
We are the first to formulate the problem of stepwise model selection (SMS) for sequence prediction, and to design and demonstrate an efficient joint-learning algorithm for this purpose.
arXiv Detail & Related papers (2020-01-12T09:42:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.