Analyzing the factors affecting usefulness of Self-Supervised
Pre-trained Representations for Speech Recognition
- URL: http://arxiv.org/abs/2203.16973v4
- Date: Thu, 18 May 2023 01:24:47 GMT
- Title: Analyzing the factors affecting usefulness of Self-Supervised
Pre-trained Representations for Speech Recognition
- Authors: Ashish Seth and Lodagala V S V Durga Prasad and Sreyan Ghosh and S.
Umesh
- Abstract summary: Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition systems.
We study the effect of domain, language, dataset size, and other aspects of our upstream pre-training SSL data on the final performance low-resource downstream ASR task.
- Score: 1.0705399532413615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) to learn high-level speech representations has
been a popular approach to building Automatic Speech Recognition (ASR) systems
in low-resource settings. However, the common assumption made in literature is
that a considerable amount of unlabeled data is available for the same domain
or language that can be leveraged for SSL pre-training, which we acknowledge is
not feasible in a real-world setting. In this paper, as part of the Interspeech
Gram Vaani ASR challenge, we try to study the effect of domain, language,
dataset size, and other aspects of our upstream pre-training SSL data on the
final performance low-resource downstream ASR task. We also build on the
continued pre-training paradigm to study the effect of prior knowledge
possessed by models trained using SSL. Extensive experiments and studies reveal
that the performance of ASR systems is susceptible to the data used for SSL
pre-training. Their performance improves with an increase in similarity and
volume of pre-training data. We believe our work will be helpful to the speech
community in building better ASR systems in low-resource settings and steer
research towards improving generalization in SSL-based pre-training for speech
systems.
Related papers
- Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages [24.856817602140193]
This study focuses on two endangered Austronesian languages, Amis and Seediq.
We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data.
arXiv Detail & Related papers (2024-09-13T14:35:47Z) - Self-supervised visual learning in the low-data regime: a comparative evaluation [40.27083924454058]
Self-Supervised Learning (SSL) is a robust training methodology for contemporary Deep Neural Networks (DNNs)
This work introduces a taxonomy of modern visual SSL methods, accompanied by detailed explanations and insights regarding the main categories of approaches.
For domain-specific downstream tasks, in-domain low-data SSL pretraining outperforms the common approach of large-scale pretraining.
arXiv Detail & Related papers (2024-04-26T07:23:14Z) - An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution [5.1660803395535835]
Self-supervised learning (SSL) has shown stellar performance compared to traditional methods.
However, SSL-based ASA systems are faced with at least three data-related challenges.
These challenges include limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels.
arXiv Detail & Related papers (2024-04-11T09:06:49Z) - Stable Distillation: Regularizing Continued Pre-training for
Low-Resource Automatic Speech Recognition [54.9235160379917]
Stable Distillation is a simple and novel approach for SSL-based continued pre-training.
It boosts ASR performance in the target domain where both labeled and unlabeled data are limited.
arXiv Detail & Related papers (2023-12-20T06:02:12Z) - Understanding and Improving the Role of Projection Head in
Self-Supervised Learning [77.59320917894043]
Self-supervised learning (SSL) aims to produce useful feature representations without access to human-labeled data annotations.
Current contrastive learning approaches append a parametrized projection head to the end of some backbone network to optimize the InfoNCE objective.
This raises a fundamental question: Why is a learnable projection head required if we are to discard it after training?
arXiv Detail & Related papers (2022-12-22T05:42:54Z) - Deploying self-supervised learning in the wild for hybrid automatic
speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z) - Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations.
Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.