The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
Abstract Overview
This paper critically examines whether cosine similarity of speech emotion embeddings (particularly from emotion2vec) is a valid objective metric for evaluating emotional similarity in generated speech. Through controlled adversarial triplet tasks, dimensional sensitivity tests, and human-alignment evaluations across six speech corpora and multiple encoders, the authors demonstrate that these embedding spaces are strongly confounded by speaker identity and linguistic content. Mean-centering is applied to address latent-space anisotropy but does not resolve the fundamental problem. The study concludes that widely used EMO-SIM-style metrics reward acoustic resemblance rather than genuine emotional transfer, making them unreliable for zero-shot speech generation evaluation.
Novelty
The paper's main novelty is a direct, systematic evaluation of the emotion-similarity metric itself rather than of any speech synthesis system. It combines adversarial triplet designs across four controlled scenarios, continuous valence/arousal sensitivity tests, human preference alignment evaluation, and layer-wise probing to expose why popular emotion-embedding cosine similarities fail despite strong emotion-classification performance.
Results
In categorical adversarial settings, emotion2vec accuracy drops as low as 3.38% on CREMA-D under a linguistic distractor, falling far below the 50% chance level. In dimensional evaluation, shift-discriminability remains near random chance and trend monotonicity (Spearman's ρ) stays near zero across all datasets and encoders. In human-alignment tests, even the best emotion2vec+ variants reach only 52.25%–65.00% accuracy, and layer-wise analysis shows perceptual alignment degrading from 58.0% at L0 to 45.0% at L7.
Key Points
- Emotion embedding cosine similarity is heavily confounded by speaker identity and linguistic content, causing it to actively penalize correct emotional matches when acoustic features differ in zero-shot settings.
- Mean-centering addresses latent-space anisotropy but does not resolve weak categorical robustness or poor dimensional sensitivity, with shift discriminability and trend monotonicity remaining near chance across datasets.
- Alignment with human judgments is limited (52.25%–65.00% accuracy for fine-tuned variants), and deeper emotion2vec transformer layers further degrade perceptual alignment rather than improving it.