AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment
- URL: http://arxiv.org/abs/2305.04476v4
- Date: Wed, 24 May 2023 16:37:42 GMT
- Title: AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment
- Authors: Ruiqi Li, Rongjie Huang, Lichao Zhang, Jinglin Liu, Zhou Zhao
- Abstract summary: The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
- Score: 67.10208647482109
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The speech-to-singing (STS) voice conversion task aims to generate singing
samples corresponding to speech recordings while facing a major challenge: the
alignment between the target (singing) pitch contour and the source (speech)
content is difficult to learn in a text-free situation. This paper proposes
AlignSTS, an STS model based on explicit cross-modal alignment, which views
speech variance such as pitch and content as different modalities. Inspired by
the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1)
adopts a novel rhythm adaptor to predict the target rhythm representation to
bridge the modality gap between content and pitch, where the rhythm
representation is computed in a simple yet effective way and is quantized into
a discrete space; and 2) uses the predicted rhythm representation to re-align
the content based on cross-attention and conducts a cross-modal fusion for
re-synthesize. Extensive experiments show that AlignSTS achieves superior
performance in terms of both objective and subjective metrics. Audio samples
are available at https://alignsts.github.io.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Rhythm Modeling for Voice Conversion [23.995555525421224]
We introduce Urhythmic-an unsupervised method for rhythm conversion.
We first divide source audio into segments approximating sonorants, obstruents, and silences.
We then model rhythm by estimating speaking rate or the duration distribution of each segment type.
Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.
arXiv Detail & Related papers (2023-07-12T09:35:16Z) - Speech Representation Disentanglement with Adversarial Mutual
Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic.
We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components.
Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.