uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models
- URL: http://arxiv.org/abs/2310.00900v1
- Date: Mon, 2 Oct 2023 04:36:39 GMT
- Title: uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models
- Authors: Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang,
Bhiksha Raj, Dong Yu
- Abstract summary: We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner.
Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
- Score: 57.71199494492223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech enhancement aims to improve the quality of speech signals in terms of
quality and intelligibility, and speech editing refers to the process of
editing the speech according to specific user needs. In this paper, we propose
a Unified Speech Enhancement and Editing (uSee) model with conditional
diffusion models to handle various tasks at the same time in a generative
manner. Specifically, by providing multiple types of conditions including
self-supervised learning embeddings and proper text prompts to the score-based
diffusion model, we can enable controllable generation of the unified speech
enhancement and editing model to perform corresponding actions on the source
speech. Our experiments show that our proposed uSee model can achieve superior
performance in both speech denoising and dereverberation compared to other
related generative speech enhancement models, and can perform speech editing
given desired environmental sound text description, signal-to-noise ratios
(SNR), and room impulse responses (RIR). Demos of the generated speech are
available at https://muqiaoy.github.io/usee.
Related papers
- Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments [0.2916558661202724]
We develop a transformer-based model that jointly performs speech recognition and speaker identification.
We show that the joint model performs comparably to Whisper under clean conditions.
Our results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.
arXiv Detail & Related papers (2024-10-07T18:39:59Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Cross-Utterance Conditioned VAE for Speech Generation [27.5887600344053]
We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation.
We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
arXiv Detail & Related papers (2023-09-08T06:48:41Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.
Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.