Learning De-identified Representations of Prosody from Raw Audio
- URL: http://arxiv.org/abs/2107.08248v1
- Date: Sat, 17 Jul 2021 14:37:25 GMT
- Title: Learning De-identified Representations of Prosody from Raw Audio
- Authors: Jack Weston, Raphael Lenain, Udeepa Meepegama and Emil Fristed
- Abstract summary: We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal.
We exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations.
- Score: 7.025418443146435
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose a method for learning de-identified prosody representations from
raw audio using a contrastive self-supervised signal. Whereas prior work has
relied on conditioning models on bottlenecks, we introduce a set of inductive
biases that exploit the natural structure of prosody to minimize timbral
information and decouple prosody from speaker representations. Despite
aggressive downsampling of the input and having no access to linguistic
information, our model performs comparably to state-of-the-art speech
representations on DAMMP, a new benchmark we introduce for spoken language
understanding. We use minimum description length probing to show that our
representations have selectively learned the subcomponents of non-timbral
prosody, and that the product quantizer naturally disentangles them without
using bottlenecks. We derive an information-theoretic definition of speech
de-identifiability and use it to demonstrate that our prosody representations
are less identifiable than other speech representations.
Related papers
- DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling [16.73336092521471]
This paper aims to remove speaker information by exploiting the structured nature of speech.
A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction.
To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.
arXiv Detail & Related papers (2024-04-01T01:49:09Z) - Establishing degrees of closeness between audio recordings along
different dimensions using large-scale cross-lingual models [4.349838917565205]
We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata.
Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects.
The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines.
arXiv Detail & Related papers (2024-02-08T11:31:23Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings [4.582129557845177]
This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
arXiv Detail & Related papers (2022-10-23T21:16:09Z) - Learning Invariant Representation and Risk Minimized for Unsupervised
Accent Domain Adaptation [32.75866643254402]
Unsupervised representation learning for speech audios attained impressive performances for speech recognition tasks.
In this work, we explore learning domain-invariant representations via a direct mapping of speech representations to their corresponding high-level linguistic informations.
Results prove that the learned latents not only capture the articulatory feature of each phoneme but also enhance the adaptation ability, outperforming the baseline largely on accented benchmarks.
arXiv Detail & Related papers (2022-10-15T03:56:31Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.