Combining Spectral and Self-Supervised Features for Low Resource Speech
Recognition and Translation
- URL: http://arxiv.org/abs/2204.02470v1
- Date: Tue, 5 Apr 2022 20:09:15 GMT
- Title: Combining Spectral and Self-Supervised Features for Low Resource Speech
Recognition and Translation
- Authors: Dan Berrebbi, Jiatong Shi, Brian Yan, Osbel Lopez-Francisco, Jonathan
D. Amith, Shinji Watanabe
- Abstract summary: Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks.
The quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain.
We propose a learnable and interpretable framework to combine SF and SSL representations.
- Score: 27.857955394020475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-Supervised Learning (SSL) models have been successfully applied in
various deep learning-based speech tasks, particularly those with a limited
amount of data. However, the quality of SSL representations depends highly on
the relatedness between the SSL training domain(s) and the target data domain.
On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks
are hand-crafted non-learnable components, and could be more robust to domain
shifts. The present work examines the assumption that combining non-learnable
SF extractors to SSL models is an effective approach to low resource speech
tasks. We propose a learnable and interpretable framework to combine SF and SSL
representations. The proposed framework outperforms significantly both baseline
and SSL models on Automatic Speech Recognition (ASR) and Speech Translation
(ST) tasks on three low resource datasets. We additionally design a mixture of
experts based combination model. This last model reveals that the relative
contribution of SSL models over conventional SF extractors is very small in
case of domain mismatch between SSL training set and the target language data.
Related papers
- SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS [18.701864254184308]
Self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS.
In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker.
arXiv Detail & Related papers (2024-08-20T12:09:58Z) - A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect [11.013934239276036]
Speech encoders pretrained through self-supervised learning (SSL) have demonstrated remarkable performance in various downstream tasks.
This paper contributes by comparing the effectiveness of SSL approaches in the context of the low-resource spoken Tunisian Arabic dialect.
arXiv Detail & Related papers (2024-07-05T14:21:36Z) - Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining [34.64600580301882]
We establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL.
In classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections.
In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts hinders its generation performance.
arXiv Detail & Related papers (2024-07-01T03:35:59Z) - Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z) - Reverse Engineering Self-Supervised Learning [17.720366509919167]
Self-supervised learning (SSL) is a powerful tool in machine learning.
This paper presents an in-depth empirical analysis of SSL-trained representations.
arXiv Detail & Related papers (2023-05-24T23:15:28Z) - Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning [13.391307807956673]
We propose a novel automatic pronunciation assessment method based on self-supervised learning (SSL) models.
First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language learners.
We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.
arXiv Detail & Related papers (2022-04-08T06:13:55Z) - DATA: Domain-Aware and Task-Aware Pre-training [94.62676913928831]
We present DATA, a simple yet effective NAS approach specialized for self-supervised learning (SSL)
Our method achieves promising results across a wide range of computation costs on downstream tasks, including image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2022-03-17T02:38:49Z) - Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z) - On Data-Augmentation and Consistency-Based Semi-Supervised Learning [77.57285768500225]
Recently proposed consistency-based Semi-Supervised Learning (SSL) methods have advanced the state of the art in several SSL tasks.
Despite these advances, the understanding of these methods is still relatively limited.
arXiv Detail & Related papers (2021-01-18T10:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.