Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications
- URL: http://arxiv.org/abs/2307.07325v1
- Date: Fri, 14 Jul 2023 13:02:10 GMT
- Title: Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications
- Authors: Varun Krishna, Tarun Sai, Sriram Ganapathy
- Abstract summary: We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
- Score: 37.89857769906568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The representation learning of speech, without textual resources, is an area
of significant interest for many low resource speech applications. In this
paper, we describe an approach to self-supervised representation learning from
raw audio using a hidden unit clustering (HUC) framework. The input to the
model consists of audio samples that are windowed and processed with 1-D
convolutional layers. The learned "time-frequency" representations from the
convolutional neural network (CNN) module are further processed with long short
term memory (LSTM) layers which generate a contextual vector representation for
every windowed segment. The HUC framework, allowing the categorization of the
representations into a small number of phoneme-like units, is used to train the
model for learning semantically rich speech representations. The targets
consist of phoneme-like pseudo labels for each audio segment and these are
generated with an iterative k-means algorithm. We explore techniques that
improve the speaker invariance of the learned representations and illustrate
the effectiveness of the proposed approach on two settings, i) completely
unsupervised speech applications on the sub-tasks described as part of the
ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition
(ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi
dataset. In these experiments, we achieve state-of-art results for various
ZeroSpeech tasks. Further, on the ASR experiments, the HUC representations are
shown to improve significantly over other established benchmarks based on
Wav2vec, HuBERT and Best-RQ.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Learning Semantic Information from Raw Audio Signal Using Both
Contextual and Phonetic Representations [18.251845041785906]
We propose a framework to learn semantics from raw audio signals using two types of representations.
We introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions.
For the language model, we adopt a dual-channel architecture to incorporate both types of representation.
arXiv Detail & Related papers (2024-02-02T10:39:58Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.