Related papers: An Attribute-Aligned Strategy for Learning Speech Representation

An Attribute-Aligned Strategy for Learning Speech Representation

URL: http://arxiv.org/abs/2106.02810v1
Date: Sat, 5 Jun 2021 06:19:14 GMT
Title: An Attribute-Aligned Strategy for Learning Speech Representation
Authors: Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee
Abstract summary: We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes. Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
Score: 57.891727280493015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advancement in speech technology has brought convenience to our life. However, the concern is on the rise as speech signal contains multiple personal attributes, which would lead to either sensitive information leakage or bias toward decision. In this work, we propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes, to derive an identity-free representation for speech emotion recognition (SER), and an emotionless representation for speaker verification (SV). Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV, comparing to the current state-of-the-art method of using adversarial learning applied on a large emotion corpora, the MSP-Podcast. Also, our proposed learning strategy reduces the model and training process needed to achieve multiple privacy-preserving tasks.

Related papers

VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z)
Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation [58.189703277322224]
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion. Emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. We propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation.
arXiv Detail & Related papers (2025-04-08T04:34:38Z)
Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition [17.568724398229232]
Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction. This study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance. The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets.
arXiv Detail & Related papers (2024-06-01T00:39:55Z)
Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z)
DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment [34.19748360507656]
DurFlex-EVC is a duration-flexible EVC framework that operates without the need for text or alignment information. We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment. We also design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech.
arXiv Detail & Related papers (2024-01-16T03:39:35Z)
Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z)
MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition [7.81011775615268]
We introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER. Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes. Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet achieves superior performance compared to state-of-the-art SER approaches.
arXiv Detail & Related papers (2023-08-08T03:43:24Z)
Disentangling Prosody Representations with Unsupervised Speech Reconstruction [22.873286925385543]
The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks.
arXiv Detail & Related papers (2022-12-14T01:37:35Z)
Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS. We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
Attribute Inference Attack of Speech Emotion Recognition in Federated Learning Settings [56.93025161787725]
Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing local data. We propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters. We show that the attribute inference attack is achievable for SER systems trained using FL.
arXiv Detail & Related papers (2021-12-26T16:50:42Z)
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. We propose a new interactive training paradigm for ETTS, denoted as i-ETTS. We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.