Towards Disentangled Speech Representations
- URL: http://arxiv.org/abs/2208.13191v1
- Date: Sun, 28 Aug 2022 10:03:55 GMT
- Title: Towards Disentangled Speech Representations
- Authors: Cal Peyser, Ronny Huang Andrew Rosenberg Tara N. Sainath, Michael
Picheny, Kyunghyun Cho
- Abstract summary: We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
- Score: 65.7834494783044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The careful construction of audio representations has become a dominant
feature in the design of approaches to many speech tasks. Increasingly, such
approaches have emphasized "disentanglement", where a representation contains
only parts of the speech signal relevant to transcription while discarding
irrelevant information. In this paper, we construct a representation learning
task based on joint modeling of ASR and TTS, and seek to learn a representation
of audio that disentangles that part of the speech signal that is relevant to
transcription from that part which is not. We present empirical evidence that
successfully finding such a representation is tied to the randomness inherent
in training. We then make the observation that these desired, disentangled
solutions to the optimization problem possess unique statistical properties.
Finally, we show that enforcing these properties during training improves WER
by 24.5% relative on average for our joint modeling task. These observations
motivate a novel approach to learning effective audio representations.
Related papers
- Exploring the Benefits of Tokenization of Discrete Acoustic Units [4.591279524925446]
Tokenization algorithms merge the units of a base vocabulary into larger, variable-rate units.
We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed.
arXiv Detail & Related papers (2024-06-08T18:34:28Z) - Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction [13.5641621193917]
In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance.
Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production.
We introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements.
arXiv Detail & Related papers (2024-04-19T09:08:44Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning [25.743503223389784]
We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
arXiv Detail & Related papers (2022-12-10T14:01:54Z) - Learning De-identified Representations of Prosody from Raw Audio [7.025418443146435]
We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal.
We exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations.
arXiv Detail & Related papers (2021-07-17T14:37:25Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.