Efficient data selection employing Semantic Similarity-based Graph
Structures for model training
- URL: http://arxiv.org/abs/2402.14888v1
- Date: Thu, 22 Feb 2024 09:43:53 GMT
- Title: Efficient data selection employing Semantic Similarity-based Graph
Structures for model training
- Authors: Roxana Petcu and Subhadeep Maji
- Abstract summary: This paper introduces Semantics for data SAliency in Model performance Estimation (SeSaME)
It is an efficient data sampling mechanism solely based on textual information without passing the data through a compute-heavy model.
The application of this approach is demonstrated in the use case of low-resource automated speech recognition (ASR) models.
- Score: 1.5845679507219355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent developments in natural language processing (NLP) have highlighted the
need for substantial amounts of data for models to capture textual information
accurately. This raises concerns regarding the computational resources and time
required for training such models. This paper introduces Semantics for data
SAliency in Model performance Estimation (SeSaME). It is an efficient data
sampling mechanism solely based on textual information without passing the data
through a compute-heavy model or other intensive pre-processing
transformations. The application of this approach is demonstrated in the use
case of low-resource automated speech recognition (ASR) models, which
excessively rely on text-to-speech (TTS) calls when using augmented data.
SeSaME learns to categorize new incoming data points into speech recognition
difficulty buckets by employing semantic similarity-based graph structures and
discrete ASR information from homophilous neighbourhoods through message
passing. The results indicate reliable projections of ASR performance, with a
93% accuracy increase when using the proposed method compared to random
predictions, bringing non-trivial information on the impact of textual
representations in speech models. Furthermore, a series of experiments show
both the benefits and challenges of using the ASR information on incoming data
to fine-tune the model. We report a 7% drop in validation loss compared to
random sampling, 7% WER drop with non-local aggregation when evaluating against
a highly difficult dataset, and 1.8% WER drop with local aggregation and high
semantic similarity between datasets.
Related papers
- Dynamic Data Pruning for Automatic Speech Recognition [58.95758272440217]
We introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers fine-grained pruning granularities specifically tailored for speech-related datasets.
Our experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss.
arXiv Detail & Related papers (2024-06-26T14:17:36Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z) - Towards Synthetic Multivariate Time Series Generation for Flare
Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest.
In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.