Knowledge Distillation and Data Selection for Semi-Supervised Learning
in CTC Acoustic Models
- URL: http://arxiv.org/abs/2008.03923v1
- Date: Mon, 10 Aug 2020 07:00:08 GMT
- Title: Knowledge Distillation and Data Selection for Semi-Supervised Learning
in CTC Acoustic Models
- Authors: Prakhar Swarup, Debmalya Chakrabarty, Ashtosh Sapru, Hitesh Tulsiani,
Harish Arsikere, Sri Garimella
- Abstract summary: Semi-supervised learning (SSL) is an active area of research which aims to utilize unlabelled data in order to improve the accuracy of speech recognition systems.
Our aim is to establish the importance of good criteria in selecting samples from a large pool of unlabelled data.
We perform empirical investigations of different data selection methods to answer this question and quantify the effect of different sampling strategies.
- Score: 9.496916045581736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-supervised learning (SSL) is an active area of research which aims to
utilize unlabelled data in order to improve the accuracy of speech recognition
systems. The current study proposes a methodology for integration of two key
ideas: 1) SSL using connectionist temporal classification (CTC) objective and
teacher-student based learning 2) Designing effective data-selection mechanisms
for leveraging unlabelled data to boost performance of student models. Our aim
is to establish the importance of good criteria in selecting samples from a
large pool of unlabelled data based on attributes like confidence measure,
speaker and content variability. The question we try to answer is: Is it
possible to design a data selection mechanism which reduces dependence on a
large set of randomly selected unlabelled samples without compromising on Word
Error Rate (WER)? We perform empirical investigations of different data
selection methods to answer this question and quantify the effect of different
sampling strategies. On a semi-supervised ASR setting with 40000 hours of
carefully selected unlabelled data, our CTC-SSL approach gives 17% relative WER
improvement over a baseline CTC system trained with labelled data. It also
achieves on-par performance with CTC-SSL system trained on order of magnitude
larger unlabeled data based on random sampling.
Related papers
- Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement [8.509688686402438]
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities.
This work addresses the question: How can we determine the optimal subset of data for effective training?
Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset.
arXiv Detail & Related papers (2024-09-17T17:25:31Z) - Combating Label Noise With A General Surrogate Model For Sample
Selection [84.61367781175984]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.
We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised
Speech Models [13.956691231452336]
Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models.
Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget.
arXiv Detail & Related papers (2022-12-03T18:05:08Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint
Localization [88.74813798138466]
Localizing keypoints of an object is a basic visual problem.
Supervised learning of a keypoint localization network often requires a large amount of data.
We propose to automatically select reliable pseudo-labeled samples with a series of dynamic thresholds.
arXiv Detail & Related papers (2022-01-21T09:51:58Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z) - Multiple-criteria Based Active Learning with Fixed-size Determinantal
Point Processes [43.71112693633952]
We introduce a multiple-criteria based active learning algorithm, which incorporates three complementary criteria, i.e., informativeness, representativeness and diversity.
We show that our method performs significantly better and is more stable than other multiple-criteria based AL algorithms.
arXiv Detail & Related papers (2021-07-04T13:22:54Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Industry Scale Semi-Supervised Learning for Natural Language
Understanding [14.844450283047234]
This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework.
We investigate two questions related to the use of unlabeled data in production SSL context.
We compare four widely used SSL techniques, Pseudo-Label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT)
arXiv Detail & Related papers (2021-03-29T18:24:02Z) - Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning [54.85397562961903]
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available.
We address a more complex novel scenario named open-set SSL, where out-of-distribution (OOD) samples are contained in unlabeled data.
Our method achieves state-of-the-art results by successfully eliminating the effect of OOD samples.
arXiv Detail & Related papers (2020-07-22T10:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.