SpeechLMScore: Evaluating speech generation using speech language model
- URL: http://arxiv.org/abs/2212.04559v1
- Date: Thu, 8 Dec 2022 21:00:15 GMT
- Title: SpeechLMScore: Evaluating speech generation using speech language model
- Authors: Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe
- Abstract summary: We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model.
It does not require human annotation and is a highly scalable framework.
Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks.
- Score: 43.20067175503602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While human evaluation is the most reliable metric for evaluating speech
generation systems, it is generally costly and time-consuming. Previous studies
on automatic speech quality assessment address the problem by predicting human
evaluation scores with machine learning models. However, they rely on
supervised learning and thus suffer from high annotation costs and domain-shift
problems. We propose SpeechLMScore, an unsupervised metric to evaluate
generated speech using a speech-language model. SpeechLMScore computes the
average log-probability of a speech signal by mapping it into discrete tokens
and measures the average probability of generating the sequence of tokens.
Therefore, it does not require human annotation and is a highly scalable
framework. Evaluation results demonstrate that the proposed metric shows a
promising correlation with human evaluation scores on different speech
generation tasks including voice conversion, text-to-speech, and speech
enhancement.
Related papers
- A Suite for Acoustic Language Model Evaluation [20.802090523583196]
We introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response.
We evaluate several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method.
arXiv Detail & Related papers (2024-09-11T17:34:52Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Investigating model performance in language identification: beyond
simple error statistics [28.128924654154087]
Language development experts need tools that can automatically identify languages from fluent, conversational speech.
We investigate how well a number of language identification systems perform on individual recordings and speech units with different linguistic properties.
arXiv Detail & Related papers (2023-05-30T10:32:53Z) - Time out of Mind: Generating Rate of Speech conditioned on emotion and
speaker [0.0]
We train a GAN conditioned on emotion to generate worth lengths for a given input text.
These word lengths are relative neutral speech and can be provided to a text-to-speech system to generate more expressive speech.
We were able to achieve better performances on objective measures for neutral speech, and better time alignment for happy speech when compared to an out-of-box model.
arXiv Detail & Related papers (2023-01-29T02:58:01Z) - BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z) - Benchmarking Evaluation Metrics for Code-Switching Automatic Speech
Recognition [19.763431520942028]
We develop a benchmark data set of code-switching speech recognition hypotheses with human judgments.
We define clear guidelines for minimal editing of automatic hypotheses.
We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
arXiv Detail & Related papers (2022-11-22T08:14:07Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.