FuguReport

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Authors Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Wen Hsu, Yun-Man Hsu, Chun Wei Chen, Shrikanth Narayanan, Hung-yi Lee
Affiliations National Taiwan University / Gilbert AI Lab / University of Southern California
Categories Evaluation / Speech Generation Evaluation / Emotion embedding similarity assessment, Task / Emotion Recognition / Emotion cue capturing across speakers and languages, Method / Embedding Methods / Use of emotion2vec encoders
License CC BY-SA 4.0

Abstract Overview

This paper critically examines whether cosine similarity of speech emotion embeddings (particularly from emotion2vec) is a valid objective metric for evaluating emotional similarity in generated speech. Through controlled adversarial triplet tasks, dimensional sensitivity tests, and human-alignment evaluations across six speech corpora and multiple encoders, the authors demonstrate that these embedding spaces are strongly confounded by speaker identity and linguistic content. Mean-centering is applied to address latent-space anisotropy but does not resolve the fundamental problem. The study concludes that widely used EMO-SIM-style metrics reward acoustic resemblance rather than genuine emotional transfer, making them unreliable for zero-shot speech generation evaluation.

Novelty

The paper's main novelty is a direct, systematic evaluation of the emotion-similarity metric itself rather than of any speech synthesis system. It combines adversarial triplet designs across four controlled scenarios, continuous valence/arousal sensitivity tests, human preference alignment evaluation, and layer-wise probing to expose why popular emotion-embedding cosine similarities fail despite strong emotion-classification performance.

Results

In categorical adversarial settings, emotion2vec accuracy drops as low as 3.38% on CREMA-D under a linguistic distractor, falling far below the 50% chance level. In dimensional evaluation, shift-discriminability remains near random chance and trend monotonicity (Spearman's ρ) stays near zero across all datasets and encoders. In human-alignment tests, even the best emotion2vec+ variants reach only 52.25%–65.00% accuracy, and layer-wise analysis shows perceptual alignment degrading from 58.0% at L0 to 45.0% at L7.

Key Points

  1. Emotion embedding cosine similarity is heavily confounded by speaker identity and linguistic content, causing it to actively penalize correct emotional matches when acoustic features differ in zero-shot settings.
  2. Mean-centering addresses latent-space anisotropy but does not resolve weak categorical robustness or poor dimensional sensitivity, with shift discriminability and trend monotonicity remaining near chance across datasets.
  3. Alignment with human judgments is limited (52.25%–65.00% accuracy for fine-tuned variants), and deeper emotion2vec transformer layers further degrade perceptual alignment rather than improving it.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.