Self-supervised speech unit discovery from articulatory and acoustic
features using VQ-VAE
- URL: http://arxiv.org/abs/2206.08790v1
- Date: Fri, 17 Jun 2022 14:04:24 GMT
- Title: Self-supervised speech unit discovery from articulatory and acoustic
features using VQ-VAE
- Authors: Marc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber
- Abstract summary: This study examines how articulatory information can be used for discovering speech units in a self-supervised setting.
We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data.
Experiments were conducted on three different corpora in English and French.
- Score: 2.771610203951056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The human perception system is often assumed to recruit motor knowledge when
processing auditory speech inputs. Using articulatory modeling and deep
learning, this study examines how this articulatory information can be used for
discovering speech units in a self-supervised setting. We used vector-quantized
variational autoencoders (VQ-VAE) to learn discrete representations from
articulatory and acoustic speech data. In line with the zero-resource paradigm,
an ABX test was then used to investigate how the extracted representations
encode phonetically relevant properties. Experiments were conducted on three
different corpora in English and French. We found that articulatory information
rather organises the latent representations in terms of place of articulation
whereas the speech acoustics mainly structure the latent space in terms of
manner of articulation. We show that an optimal fusion of the two modalities
can lead to a joint representation of these phonetic dimensions more accurate
than each modality considered individually. Since articulatory information is
usually not available in a practical situation, we finally investigate the
benefit it provides when inferred from the speech acoustics in a
self-supervised manner.
Related papers
- Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Disentangling Prosody Representations with Unsupervised Speech
Reconstruction [22.873286925385543]
The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction.
Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec.
We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks.
arXiv Detail & Related papers (2022-12-14T01:37:35Z) - Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings [4.582129557845177]
This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
arXiv Detail & Related papers (2022-10-23T21:16:09Z) - Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment.
The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z) - Transferring Voice Knowledge for Acoustic Event Detection: An Empirical
Study [11.825240267691209]
This paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an acoustic event detection pipeline.
We develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process.
arXiv Detail & Related papers (2021-10-07T04:03:21Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.