A Brief Overview of Unsupervised Neural Speech Representation Learning
- URL: http://arxiv.org/abs/2203.01829v1
- Date: Tue, 1 Mar 2022 11:15:35 GMT
- Title: A Brief Overview of Unsupervised Neural Speech Representation Learning
- Authors: Lasse Borgholt, Jakob Drachmann Havtorn, Joakim Edin, Lars Maal{\o}e,
Christian Igel
- Abstract summary: We review the development of unsupervised representation learning for speech over the last decade.
We identify two primary model categories: self-supervised methods and probabilistic latent variable models.
- Score: 12.850357461259197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised representation learning for speech processing has matured
greatly in the last few years. Work in computer vision and natural language
processing has paved the way, but speech data offers unique challenges. As a
result, methods from other domains rarely translate directly. We review the
development of unsupervised representation learning for speech over the last
decade. We identify two primary model categories: self-supervised methods and
probabilistic latent variable models. We describe the models and develop a
comprehensive taxonomy. Finally, we discuss and compare models from the two
categories.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Learning to Diversify Neural Text Generation via Degenerative Model [39.961572541752005]
We propose a new approach to prevent degeneration problems by training two models.
We first train a model that is designed to amplify undesirable patterns.
We then enhance the diversity of the second model by focusing on patterns that the first model fails to learn.
arXiv Detail & Related papers (2023-09-22T04:57:10Z) - Explaining Speech Classification Models via Word-Level Audio Segments
and Paralinguistic Features [35.31998003091635]
We introduce a new approach to explain speech classification models.
We generate easy-to-interpret explanations via input perturbation on two information levels.
We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian.
arXiv Detail & Related papers (2023-09-14T14:12:34Z) - Syllable Discovery and Cross-Lingual Generalization in a Visually
Grounded, Self-Supervised Speech Model [21.286529902957724]
We show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian.
arXiv Detail & Related papers (2023-05-19T05:19:04Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language [85.9019051663368]
data2vec is a framework that uses the same learning method for either speech, NLP or computer vision.
The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.
Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance.
arXiv Detail & Related papers (2022-02-07T22:52:11Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.