SERAB: A multi-lingual benchmark for speech emotion recognition
- URL: http://arxiv.org/abs/2110.03414v1
- Date: Thu, 7 Oct 2021 13:01:34 GMT
- Title: SERAB: A multi-lingual benchmark for speech emotion recognition
- Authors: Neil Scheidwasser-Clow, Mikolaj Kegler, Pierre Beckmann, Milos Cernak
- Abstract summary: Recent developments in speech emotion recognition (SER) often leverage deep neural networks (DNNs)
We present the Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for evaluating the performance and generalization capacity of different approaches for utterance-level SER.
- Score: 12.579936838293387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent developments in speech emotion recognition (SER) often leverage deep
neural networks (DNNs). Comparing and benchmarking different DNN models can
often be tedious due to the use of different datasets and evaluation protocols.
To facilitate the process, here, we present the Speech Emotion Recognition
Adaptation Benchmark (SERAB), a framework for evaluating the performance and
generalization capacity of different approaches for utterance-level SER. The
benchmark is composed of nine datasets for SER in six languages. Since the
datasets have different sizes and numbers of emotional classes, the proposed
setup is particularly suitable for estimating the generalization capacity of
pre-trained DNN-based feature extractors. We used the proposed framework to
evaluate a selection of standard hand-crafted feature sets and state-of-the-art
DNN representations. The results highlight that using only a subset of the data
included in SERAB can result in biased evaluation, while compliance with the
proposed protocol can circumvent this issue.
Related papers
- STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition [3.4355593397388597]
Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models.
We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models.
We find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER.
arXiv Detail & Related papers (2024-08-14T23:33:10Z) - A Comparative Study of Pre-trained Encoders for Low-Resource Named
Entity Recognition [10.0731894715001]
We introduce an encoder evaluation framework, and use it to compare the performance of state-of-the-art pre-trained representations on the task of low-resource NER.
We analyze a wide range of encoders pre-trained with different strategies, model architectures, intermediate-task fine-tuning, and contrastive learning.
arXiv Detail & Related papers (2022-04-11T09:48:26Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers [0.05857406612420462]
Large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks.
We propose evaluating systems through a novel measure of prediction coherence.
arXiv Detail & Related papers (2021-09-10T15:04:23Z) - Contextual Biasing of Language Models for Speech Recognition in
Goal-Oriented Conversational Agents [11.193867567895353]
Goal-oriented conversational interfaces are designed to accomplish specific tasks.
We propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time.
Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
arXiv Detail & Related papers (2021-03-18T15:38:08Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.