CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using
Cochlear Cepstrum-based Masking for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2402.06923v1
- Date: Sat, 10 Feb 2024 11:13:13 GMT
- Title: CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using
Cochlear Cepstrum-based Masking for Speech Emotion Recognition
- Authors: Ioannis Ziogas, Hessa Alfalahi, Ahsan H. Khandoker, Leontios J.
Hadjileontiadis
- Abstract summary: CochCeps-Augment is a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations.
Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis.
- Score: 5.974778743092437
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-supervised learning (SSL) for automated speech recognition in terms of
its emotional content, can be heavily degraded by the presence noise, affecting
the efficiency of modeling the intricate temporal and spectral informative
structures of speech. Recently, SSL on large speech datasets, as well as new
audio-specific SSL proxy tasks, such as, temporal and frequency masking, have
emerged, yielding superior performance compared to classic approaches drawn
from the image augmentation domain. Our proposed contribution builds upon this
successful paradigm by introducing CochCeps-Augment, a novel bio-inspired
masking augmentation task for self-supervised contrastive learning of speech
representations. Specifically, we utilize the newly introduced bio-inspired
cochlear cepstrogram (CCGRAM) to derive noise robust representations of input
speech, that are then further refined through a self-supervised learning
scheme. The latter employs SimCLR to generate contrastive views of a CCGRAM
through masking of its angle and quefrency dimensions. Our experimental
approach and validations on the emotion recognition K-EmoCon benchmark dataset,
for the first time via a speaker-independent approach, features unsupervised
pre-training, linear probing and fine-tuning. Our results potentiate
CochCeps-Augment to serve as a standard tool in speech emotion recognition
analysis, showing the added value of incorporating bio-inspired masking as an
informative augmentation task for self-supervision. Our code for implementing
CochCeps-Augment will be made available at:
https://github.com/GiannisZgs/CochCepsAugment.
Related papers
- Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - A Survey on Masked Autoencoder for Self-supervised Learning in Vision
and Beyond [64.85076239939336]
Self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP.
generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP.
Success of mask image modeling has revived the masking autoencoder.
arXiv Detail & Related papers (2022-07-30T09:59:28Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Why does Self-Supervised Learning for Speech Recognition Benefit Speaker
Recognition? [86.53044183309824]
We study which factor leads to the success of self-supervised learning on speaker-related tasks.
Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.
arXiv Detail & Related papers (2022-04-27T08:35:57Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Speech SIMCLR: Combining Contrastive and Reconstruction Objective for
Self-supervised Speech Representation Learning [20.39971017940006]
Speech SimCLR is a new self-supervised objective for speech representation learning.
During training, SimCLR applies augmentation on raw speech and its spectrogram.
arXiv Detail & Related papers (2020-10-27T02:09:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.