SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker
Embedding and Vision Transformers
- URL: http://arxiv.org/abs/2211.02366v1
- Date: Fri, 4 Nov 2022 10:49:44 GMT
- Title: SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker
Embedding and Vision Transformers
- Authors: A. Arezzo, S. Berretti
- Abstract summary: This paper develops a new learning solution for Speech Emotion Recognition.
It is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding.
Experiments have been performed on several benchmarks in a cross-corpus setting.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Speech Emotion Recognition (SER) has been investigated
mainly transforming the speech signal into spectrograms that are then
classified using Convolutional Neural Networks pretrained on generic images and
fine tuned with spectrograms. In this paper, we start from the general idea
above and develop a new learning solution for SER, which is based on Compact
Convolutional Transformers (CCTs) combined with a speaker embedding. With CCTs,
the learning power of Vision Transformers (ViT) is combined with a diminished
need for large volume of data as made possible by the convolution. This is
important in SER, where large corpora of data are usually not available. The
speaker embedding allows the network to extract an identity representation of
the speaker, which is then integrated by means of a self-attention mechanism
with the features that the CCT extracts from the spectrogram. Overall, the
solution is capable of operating in real-time showing promising results in a
cross-corpus scenario, where training and test datasets are kept separate.
Experiments have been performed on several benchmarks in a cross-corpus setting
as rarely used in the literature, with results that are comparable or superior
to those obtained with state-of-the-art network architectures. Our code is
available at https://github.com/JabuMlDev/Speaker-VGG-CCT.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Zorro: the masked multimodal transformer [68.99684436029884]
Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
We show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
arXiv Detail & Related papers (2023-01-23T17:51:39Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Synthesized Speech Detection Using Convolutional Transformer-Based
Spectrogram Analysis [16.93803259128475]
Synthesized speech can be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal.
In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer for synthesized speech detection.
arXiv Detail & Related papers (2022-05-03T22:05:35Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - A Framework for Generative and Contrastive Learning of Audio
Representations [2.8935588665357077]
We present a framework for contrastive learning for audio representations in a self supervised frame work without access to ground truth labels.
We also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.
Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model.
arXiv Detail & Related papers (2020-10-22T05:52:32Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.