Learning Speech Representation From Contrastive Token-Acoustic
Pretraining
- URL: http://arxiv.org/abs/2309.00424v5
- Date: Mon, 18 Dec 2023 12:49:49 GMT
- Title: Learning Speech Representation From Contrastive Token-Acoustic
Pretraining
- Authors: Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang,
Jianwu Dang
- Abstract summary: We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
- Score: 57.08426714676043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For fine-grained generation and recognition tasks such as
minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic
speech recognition (ASR), the intermediate representations extracted from
speech should serve as a "bridge" between text and acoustic information,
containing information from both modalities. The semantic content is
emphasized, while the paralinguistic information such as speaker identity and
acoustic details should be de-emphasized. However, existing methods for
extracting fine-grained intermediate representations from speech suffer from
issues of excessive redundancy and dimension explosion. Contrastive learning is
a good method for modeling intermediate representations from two modalities.
However, existing contrastive learning methods in the audio field focus on
extracting global descriptive information for downstream audio classification
tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these
issues, we propose a method named "Contrastive Token-Acoustic Pretraining
(CTAP)", which uses two encoders to bring phoneme and speech into a joint
multimodal space, learning how to connect phoneme and speech at the frame
level. The CTAP model is trained on 210k speech and phoneme pairs, achieving
minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a
promising solution for fine-grained generation and recognition downstream tasks
in speech processing. We provide a website with audio samples.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.