Analysing Discrete Self Supervised Speech Representation for Spoken
Language Modeling
- URL: http://arxiv.org/abs/2301.00591v1
- Date: Mon, 2 Jan 2023 10:36:40 GMT
- Title: Analysing Discrete Self Supervised Speech Representation for Spoken
Language Modeling
- Authors: Amitay Sicherman, Yossi Adi
- Abstract summary: This work profoundly analyzes discrete self-supervised speech representations through the eyes of Generative Spoken Language Modeling.
We propose practical improvements to the discrete unit for the GSLM.
- Score: 21.19785690690611
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work profoundly analyzes discrete self-supervised speech representations
through the eyes of Generative Spoken Language Modeling (GSLM). Following the
findings of such an analysis, we propose practical improvements to the discrete
unit for the GSLM. First, we start comprehending these units by analyzing them
in three axes: interpretation, visualization, and resynthesis. Our analysis
finds a high correlation between the speech units to phonemes and phoneme
families, while their correlation with speaker or gender is weaker.
Additionally, we found redundancies in the extracted units and claim that one
reason may be the units' context. Following this analysis, we propose a new,
unsupervised metric to measure unit redundancies. Finally, we use this metric
to develop new methods that improve the robustness of units clustering and show
significant improvement considering zero-resource speech metrics such as ABX.
Code and analysis tools are available under the following link.
Related papers
- Exploring the Benefits of Tokenization of Discrete Acoustic Units [4.591279524925446]
Tokenization algorithms merge the units of a base vocabulary into larger, variable-rate units.
We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed.
arXiv Detail & Related papers (2024-06-08T18:34:28Z) - A Quantitative Approach to Understand Self-Supervised Models as
Cross-lingual Feature Extractors [9.279391026742658]
We analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor.
We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations.
arXiv Detail & Related papers (2023-11-27T15:58:28Z) - "You Are An Expert Linguistic Annotator": Limits of LLMs as Analyzers of
Abstract Meaning Representation [60.863629647985526]
We examine the successes and limitations of the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning structure.
We find that models can reliably reproduce the basic format of AMR, and can often capture core event, argument, and modifier structure.
Overall, our findings indicate that these models out-of-the-box can capture aspects of semantic structure, but there remain key limitations in their ability to support fully accurate semantic analyses or parses.
arXiv Detail & Related papers (2023-10-26T21:47:59Z) - SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT [49.06057768982775]
We show that a syllabic organization emerges in learning sentence-level representation of speech.
We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z) - Discrete representations in neural models of spoken language [56.29049879393466]
We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language.
We find that the different evaluation metrics can give inconsistent results.
arXiv Detail & Related papers (2021-05-12T11:02:02Z) - Nonlinear ISA with Auxiliary Variables for Learning Speech
Representations [51.9516685516144]
We introduce a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables.
We propose an algorithm that learns unsupervised speech representations whose subspaces are independent.
arXiv Detail & Related papers (2020-07-25T14:53:09Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.