Deep Implicit Distribution Alignment Networks for Cross-Corpus Speech
Emotion Recognition
- URL: http://arxiv.org/abs/2302.08921v1
- Date: Fri, 17 Feb 2023 14:51:37 GMT
- Title: Deep Implicit Distribution Alignment Networks for Cross-Corpus Speech
Emotion Recognition
- Authors: Yan Zhao, Jincen Wang, Yuan Zong, Wenming Zheng, Hailun Lian, Li Zhao
- Abstract summary: We propose a novel deep transfer learning method called deep implicit distribution alignment networks (DIDAN)
DIDAN deals with cross-corpus speech emotion recognition problem, in which the labeled training (source) and unlabeled testing (target) speech signals come from different corpora.
To evaluate the proposed DIDAN, extensive cross-corpus SER experiments on widely-used speech emotion corpora are carried out.
- Score: 19.281716812246557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel deep transfer learning method called deep
implicit distribution alignment networks (DIDAN) to deal with cross-corpus
speech emotion recognition (SER) problem, in which the labeled training
(source) and unlabeled testing (target) speech signals come from different
corpora. Specifically, DIDAN first adopts a simple deep regression network
consisting of a set of convolutional and fully connected layers to directly
regress the source speech spectrums into the emotional labels such that the
proposed DIDAN can own the emotion discriminative ability. Then, such ability
is transferred to be also applicable to the target speech samples regardless of
corpus variance by resorting to a well-designed regularization term called
implicit distribution alignment (IDA). Unlike widely-used maximum mean
discrepancy (MMD) and its variants, the proposed IDA absorbs the idea of sample
reconstruction to implicitly align the distribution gap, which enables DIDAN to
learn both emotion discriminative and corpus invariant features from speech
spectrums. To evaluate the proposed DIDAN, extensive cross-corpus SER
experiments on widely-used speech emotion corpora are carried out. Experimental
results show that the proposed DIDAN can outperform lots of recent
state-of-the-art methods in coping with the cross-corpus SER tasks.
Related papers
- Tackling Ambiguity from Perspective of Uncertainty Inference and Affinity Diversification for Weakly Supervised Semantic Segmentation [12.308473939796945]
Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve dense tasks without laborious annotations.
The performance of WSSS, especially the stages of generating Class Activation Maps (CAMs) and refining pseudo masks, widely suffers from ambiguity.
We propose UniA, a unified single-staged WSSS framework, to tackle this issue from the perspective of uncertainty inference and affinity diversification.
arXiv Detail & Related papers (2024-04-12T01:54:59Z) - Likelihood-Aware Semantic Alignment for Full-Spectrum
Out-of-Distribution Detection [24.145060992747077]
We propose a Likelihood-Aware Semantic Alignment (LSA) framework to promote the image-text correspondence into semantically high-likelihood regions.
Extensive experiments demonstrate the remarkable OOD detection performance of our proposed LSA, surpassing existing methods by a margin of $15.26%$ and $18.88%$ on two F-OOD benchmarks.
arXiv Detail & Related papers (2023-12-04T08:53:59Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus
Speech Emotion Recognition [16.159171586384023]
Cross-corpus speech emotion recognition (SER) seeks to generalize the ability of inferring speech emotion from a well-labeled corpus to an unlabeled one.
Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment.
We propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER.
arXiv Detail & Related papers (2023-08-04T08:15:17Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty
Estimation for Facial Expression Recognition [59.52434325897716]
We propose a solution, named DMUE, to address the problem of annotation ambiguity from two perspectives.
For the former, an auxiliary multi-branch learning framework is introduced to better mine and describe the latent distribution in the label space.
For the latter, the pairwise relationship of semantic feature between instances are fully exploited to estimate the ambiguity extent in the instance space.
arXiv Detail & Related papers (2021-04-01T03:21:57Z) - Unsupervised Cross-Lingual Speech Emotion Recognition Using
DomainAdversarial Neural Network [48.1535353007371]
Cross-domain Speech Emotion Recog-nition (SER) is still a challenging taskdue to the distribution shift between source and target domains.
We propose a Domain Adversarial Neural Net-work (DANN) based approach to mitigate this distribution shiftproblem for cross-lingual SER.
arXiv Detail & Related papers (2020-12-21T08:21:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.