Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models
- URL: http://arxiv.org/abs/2407.18541v1
- Date: Fri, 26 Jul 2024 06:44:01 GMT
- Title: Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models
- Authors: Neil Shah, Shirish Karande, Vineet Gandhi,
- Abstract summary: We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task.
Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis.
Our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric.
- Score: 24.943609458024596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model's proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech. Speech samples can be found at https://nam2speech.github.io/NAM2Speech/
Related papers
- DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization [12.310318928818546]
We propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization.
We show DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity.
This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech [34.8899247119748]
We propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech.
Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet.
arXiv Detail & Related papers (2023-10-01T15:52:48Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Time out of Mind: Generating Rate of Speech conditioned on emotion and
speaker [0.0]
We train a GAN conditioned on emotion to generate worth lengths for a given input text.
These word lengths are relative neutral speech and can be provided to a text-to-speech system to generate more expressive speech.
We were able to achieve better performances on objective measures for neutral speech, and better time alignment for happy speech when compared to an out-of-box model.
arXiv Detail & Related papers (2023-01-29T02:58:01Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Synthesizing Personalized Non-speech Vocalization from Discrete Speech
Representations [3.0016140723286457]
We formulated non-speech vocalization (NSV) modeling as a text-to-speech task and verified its viability.
Specifically, we evaluated the phonetic expressivity of HUBERT speech units on NSVs and verified our model's ability to control over speaker timbre.
arXiv Detail & Related papers (2022-06-25T14:27:10Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.