Related papers: Pairwise Evaluation of Accent Similarity in Speech Synthesis

Pairwise Evaluation of Accent Similarity in Speech Synthesis

URL: http://arxiv.org/abs/2505.14410v1
Date: Tue, 20 May 2025 14:23:50 GMT
Title: Pairwise Evaluation of Accent Similarity in Speech Synthesis
Authors: Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond,
Abstract summary: We aim to enhance both subjective and objective evaluation methods for accent similarity.<n>We refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs.<n>We utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation.
Score: 11.513055793492418
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.

Related papers

On the Relationship between Accent Strength and Articulatory Features [26.865464238029748]
This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech.<n>The proposed framework leverages recent self-supervised learning articulatory inversion techniques to estimate articulatory features.<n>Results indicate that tongue positioning patterns distinguish the two dialects, with notable differences inter-dialects in rhotic and low back vowels.
arXiv Detail & Related papers (2025-07-03T20:08:28Z)
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z)
Representation of perceived prosodic similarity of conversational feedback [3.7277730514654555]
spectral and self-supervised speech representations encode prosody better than extracted pitch features.<n>It is possible to further condense and align the representations to human perception through contrastive learning.
arXiv Detail & Related papers (2025-05-19T15:47:51Z)
Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS [52.89324095217975]
Previous approaches on accent conversion mainly aimed at making non-native speech sound more native.<n>We develop a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker.
arXiv Detail & Related papers (2024-10-19T06:12:31Z)
Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z)
Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment [7.519788903817844]
We propose two Acoustic Feature Mixup strategies to address data scarcity and score-label imbalances. We integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation.
arXiv Detail & Related papers (2024-06-22T03:56:29Z)
Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis [16.497022070614236]
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm.
arXiv Detail & Related papers (2024-02-11T02:26:43Z)
Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discriminability. Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z)
Transfer the linguistic representations from TTS to accent conversion with non-parallel data [7.376032484438044]
Accent conversion aims to convert the accent of a source speech to a target accent, preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech.
arXiv Detail & Related papers (2024-01-07T16:39:34Z)
Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z)
What You Hear Is What You See: Audio Quality Metrics From Image Quality Metrics [44.659718609385315]
We investigate the feasibility of utilizing state-of-the-art image perceptual metrics for evaluating audio signals by representing them as spectrograms. We customise one of the metrics which has a psychoacoustically plausible architecture to account for the peculiarities of sound signals. We evaluate the effectiveness of our proposed metric and several baseline metrics using a music dataset.
arXiv Detail & Related papers (2023-05-19T10:43:57Z)
Rethinking and Refining the Distinct Metric [61.213465863627476]
We refine the calculation of distinct scores by re-scaling the number of distinct tokens based on its expectation. We provide both empirical and theoretical evidence to show that our method effectively removes the biases exhibited in the original distinct score.
arXiv Detail & Related papers (2022-02-28T07:36:30Z)
Phonetic Word Embeddings [1.2192936362342826]
We present a novel methodology for calculating the phonetic similarity between words taking motivation from the human perception of sounds. This metric is employed to learn a continuous vector embedding space that groups similar sounding words together. The efficacy of the method is presented for two different languages (English, Hindi) and performance gains over previous reported works are discussed.
arXiv Detail & Related papers (2021-09-30T01:46:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.