An Information-Theoretic Analysis of Self-supervised Discrete
Representations of Speech
- URL: http://arxiv.org/abs/2306.02405v1
- Date: Sun, 4 Jun 2023 16:52:11 GMT
- Title: An Information-Theoretic Analysis of Self-supervised Discrete
Representations of Speech
- Authors: Badr M. Abdullah, Mohammed Maqsood Shaik, Bernd M\"obius, Dietrich
Klakow
- Abstract summary: We develop an information-theoretic framework whereby we represent each phonetic category as a distribution over discrete units.
Our study demonstrates that the entropy of phonetic distributions reflects the variability of the underlying speech sounds.
While our study confirms the lack of direct, one-to-one correspondence, we find an intriguing, indirect relationship between phonetic categories and discrete units.
- Score: 17.07957283733822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised representation learning for speech often involves a
quantization step that transforms the acoustic input into discrete units.
However, it remains unclear how to characterize the relationship between these
discrete units and abstract phonetic categories such as phonemes. In this
paper, we develop an information-theoretic framework whereby we represent each
phonetic category as a distribution over discrete units. We then apply our
framework to two different self-supervised models (namely wav2vec 2.0 and XLSR)
and use American English speech as a case study. Our study demonstrates that
the entropy of phonetic distributions reflects the variability of the
underlying speech sounds, with phonetically similar sounds exhibiting similar
distributions. While our study confirms the lack of direct, one-to-one
correspondence, we find an intriguing, indirect relationship between phonetic
categories and discrete units.
Related papers
- Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.
Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training.
This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - An analysis on the effects of speaker embedding choice in non
auto-regressive TTS [4.619541348328938]
We introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets.
We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well.
arXiv Detail & Related papers (2023-07-19T10:57:54Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Self-supervised speech unit discovery from articulatory and acoustic
features using VQ-VAE [2.771610203951056]
This study examines how articulatory information can be used for discovering speech units in a self-supervised setting.
We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data.
Experiments were conducted on three different corpora in English and French.
arXiv Detail & Related papers (2022-06-17T14:04:24Z) - Autoregressive Co-Training for Learning Discrete Speech Representations [19.400428010647573]
We consider a generative model with discrete latent variables that learns a discrete representation for speech.
We find that the proposed approach learns discrete representation that is highly correlated with phonetic units.
arXiv Detail & Related papers (2022-03-29T18:17:18Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Do Acoustic Word Embeddings Capture Phonological Similarity? An
Empirical Study [12.210797811981173]
In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity?
We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity.
Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity.
arXiv Detail & Related papers (2021-06-16T10:47:56Z) - Decomposing lexical and compositional syntax and semantics with deep
language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension.
Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations.
The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.