Speaker Normalization for Self-supervised Speech Emotion Recognition
- URL: http://arxiv.org/abs/2202.01252v1
- Date: Wed, 2 Feb 2022 19:30:47 GMT
- Title: Speaker Normalization for Self-supervised Speech Emotion Recognition
- Authors: Itai Gat, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, Ron Hoory
- Abstract summary: We propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation.
We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.
- Score: 16.044405846513495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large speech emotion recognition datasets are hard to obtain, and small
datasets may contain biases. Deep-net-based classifiers, in turn, are prone to
exploit those biases and find shortcuts such as speaker characteristics. These
shortcuts usually harm a model's ability to generalize. To address this
challenge, we propose a gradient-based adversary learning framework that learns
a speech emotion recognition task while normalizing speaker characteristics
from the feature representation. We demonstrate the efficacy of our method on
both speaker-independent and speaker-dependent settings and obtain new
state-of-the-art results on the challenging IEMOCAP dataset.
Related papers
- Revealing Emotional Clusters in Speaker Embeddings: A Contrastive
Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization.
Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters.
We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z) - Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction [17.05599594354308]
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information.
In the task of target speech extraction, certain elements of global and local semantic information in the reference speech can lead to speaker confusion.
We propose a self-supervised disentangled representation learning method to overcome this challenge.
arXiv Detail & Related papers (2023-12-16T03:48:24Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Improving Self-Supervised Speech Representations by Disentangling
Speakers [56.486084431528695]
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus.
Disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well.
We propose a new SSL method that can achieve speaker disentanglement without severe loss of content.
arXiv Detail & Related papers (2022-04-20T04:56:14Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Embedded Emotions -- A Data Driven Approach to Learn Transferable
Feature Representations from Raw Speech Input for Emotion Recognition [1.4556324908347602]
We investigate the applicability of transferring knowledge learned from large text and audio corpora to the task of automatic emotion recognition.
Our results show that the learned feature representations can be effectively applied for classifying emotions from spoken language.
arXiv Detail & Related papers (2020-09-30T09:18:31Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.