Training speech emotion classifier without categorical annotations
- URL: http://arxiv.org/abs/2210.07642v1
- Date: Fri, 14 Oct 2022 08:47:41 GMT
- Title: Training speech emotion classifier without categorical annotations
- Authors: Meysam Shamsi, Marie Tahon
- Abstract summary: The main aim of this study is to investigate the relation between these two representations.
The proposed approach contains a regressor model which is trained to predict a vector of continuous values in dimensional representation for given speech audio.
The output of this model can be interpreted as an emotional category using a mapping algorithm.
- Score: 1.5609988622100528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There are two paradigms of emotion representation, categorical labeling and
dimensional description in continuous space. Therefore, the emotion recognition
task can be treated as a classification or regression. The main aim of this
study is to investigate the relation between these two representations and
propose a classification pipeline that uses only dimensional annotation. The
proposed approach contains a regressor model which is trained to predict a
vector of continuous values in dimensional representation for given speech
audio. The output of this model can be interpreted as an emotional category
using a mapping algorithm. We investigated the performances of a combination of
three feature extractors, three neural network architectures, and three mapping
algorithms on two different corpora. Our study shows the advantages and
limitations of the classification via regression approach.
Related papers
- Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition [55.97779732051921]
State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users.
A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models.
Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
arXiv Detail & Related papers (2024-10-01T10:42:55Z) - Prototype-based Embedding Network for Scene Graph Generation [105.97836135784794]
Current Scene Graph Generation (SGG) methods explore contextual information to predict relationships among entity pairs.
Due to the diverse visual appearance of numerous possible subject-object combinations, there is a large intra-class variation within each predicate category.
Prototype-based Embedding Network (PE-Net) models entities/predicates with prototype-aligned compact and distinctive representations.
PL is introduced to help PE-Net efficiently learn such entitypredicate matching, and Prototype Regularization (PR) is devised to relieve the ambiguous entity-predicate matching.
arXiv Detail & Related papers (2023-03-13T13:30:59Z) - Understanding Imbalanced Semantic Segmentation Through Neural Collapse [81.89121711426951]
We show that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes.
We introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure.
Our method ranks 1st and sets a new record on the ScanNet200 test leaderboard.
arXiv Detail & Related papers (2023-01-03T13:51:51Z) - Speech Emotion Recognition Using Deep Sparse Auto-Encoder Extreme
Learning Machine with a New Weighting Scheme and Spectro-Temporal Features
Along with Classical Feature Selection and A New Quantum-Inspired Dimension
Reduction Method [3.8073142980733]
A system for speech emotion recognition (SER) based on speech signal is proposed.
The system consists of three stages: feature extraction, feature selection, and finally feature classification.
A new weighting method has also been proposed to deal with class imbalance, which is more efficient than existing weighting methods.
arXiv Detail & Related papers (2021-11-13T11:09:38Z) - Learning Debiased and Disentangled Representations for Semantic
Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation.
By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes.
Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z) - GAN for Vision, KG for Relation: a Two-stage Deep Network for Zero-shot
Action Recognition [33.23662792742078]
We propose a two-stage deep neural network for zero-shot action recognition.
In the sampling stage, we utilize a generative adversarial networks (GAN) trained by action features and word vectors of seen classes.
In the classification stage, we construct a knowledge graph based on the relationship between word vectors of action classes and related objects.
arXiv Detail & Related papers (2021-05-25T09:34:42Z) - Unsupervised low-rank representations for speech emotion recognition [78.38221758430244]
We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition.
We report speech emotion recognition (SER) results for learned representations on two databases using different classification methods.
arXiv Detail & Related papers (2021-04-14T18:30:58Z) - Three-class Overlapped Speech Detection using a Convolutional Recurrent
Neural Network [32.59704287230343]
The proposed approach classifies into three classes: non-speech, single speaker speech, and overlapped speech.
A convolutional recurrent neural network architecture is explored to benefit from both convolutional layer's capability to model local patterns and recurrent layer's ability to model sequential information.
The proposed overlapped speech detection model establishes a state-of-the-art performance with a precision of 0.6648 and a recall of 0.3222 on the DIHARD II evaluation set.
arXiv Detail & Related papers (2021-04-07T03:01:34Z) - Metric Learning vs Classification for Disentangled Music Representation
Learning [36.74680586571013]
We present a single representation learning framework that elucidates the relationship between metric learning, classification, and disentanglement in a holistic manner.
We find that classification-based models are generally advantageous for training time, similarity retrieval, and auto-tagging, while deep metric learning exhibits better performance for triplet-prediction.
arXiv Detail & Related papers (2020-08-09T13:53:12Z) - Commonality-Parsing Network across Shape and Appearance for Partially
Supervised Instance Segmentation [71.59275788106622]
We propose to learn the underlying class-agnostic commonalities that can be generalized from mask-annotated categories to novel categories.
Our model significantly outperforms the state-of-the-art methods on both partially supervised setting and few-shot setting for instance segmentation on COCO dataset.
arXiv Detail & Related papers (2020-07-24T07:23:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.