Related papers: Rhythm Modeling for Voice Conversion

Rhythm Modeling for Voice Conversion

URL: http://arxiv.org/abs/2307.06040v1
Date: Wed, 12 Jul 2023 09:35:16 GMT
Title: Rhythm Modeling for Voice Conversion
Authors: Benjamin van Niekerk, Marc-Andr\'e Carbonneau, Herman Kamper
Abstract summary: We introduce Urhythmic-an unsupervised method for rhythm conversion. We first divide source audio into segments approximating sonorants, obstruents, and silences. We then model rhythm by estimating speaking rate or the duration distribution of each segment type. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.
Score: 23.995555525421224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic-an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody. Code and checkpoints: https://github.com/bshall/urhythmic. Audio demo page: https://ubisoft-laforge.github.io/speech/urhythmic.

Related papers

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR [18.701864254184308]
We combine rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria.
arXiv Detail & Related papers (2025-01-17T15:39:21Z)
Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z)
Speech Diarization and ASR with GMM [0.0]
Speech diarization involves the separation of individual speakers within an audio stream. ASR entails the conversion of an unknown speech waveform into a corresponding written transcription. Our primary objective typically revolves around developing a model that minimizes the Word Error Rate (WER) metric during speech transcription.
arXiv Detail & Related papers (2023-07-11T09:25:39Z)
AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment. Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z)
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z)
VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation. It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.