LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker
Recognition to Overcome Data Scarcity
- URL: http://arxiv.org/abs/2007.00659v2
- Date: Fri, 3 Jul 2020 17:02:31 GMT
- Title: LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker
Recognition to Overcome Data Scarcity
- Authors: Jordan J. Bird, Diego R. Faria, Anik\'o Ek\'art, Cristiano Premebida,
Pedro P. S. Ayrosa
- Abstract summary: In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification.
In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes.
Using character level LSTMs and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis.
- Score: 3.1428836133120543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In speech recognition problems, data scarcity often poses an issue due to the
willingness of humans to provide large amounts of data for learning and
classification. In this work, we take a set of 5 spoken Harvard sentences from
7 subjects and consider their MFCC attributes. Using character level LSTMs
(supervised learning) and OpenAI's attention-based GPT-2 models, synthetic
MFCCs are generated by learning from the data provided on a per-subject basis.
A neural network is trained to classify the data against a large dataset of
Flickr8k speakers and is then compared to a transfer learning network
performing the same task but with an initial weight distribution dictated by
learning from the synthetic data generated by the two models. The best result
for all of the 7 subjects were networks that had been exposed to synthetic
data, the model pre-trained with LSTM-produced data achieved the best result 3
times and the GPT-2 equivalent 5 times (since one subject had their best result
from both models at a draw). Through these results, we argue that speaker
classification can be improved by utilising a small amount of user data but
with exposure to synthetically-generated MFCCs which then allow the networks to
achieve near maximum classification scores.
Related papers
- Diffusion-based Neural Network Weights Generation [85.6725307453325]
We propose an efficient and adaptive transfer learning scheme through dataset-conditioned pretrained weights sampling.
Specifically, we use a latent diffusion model with a variational autoencoder that can reconstruct the neural network weights.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - TarGEN: Targeted Data Generation with Large Language Models [54.1093098278564]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in
Classification Tasks [0.0]
We compare the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks.
Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts.
arXiv Detail & Related papers (2023-04-26T23:09:02Z) - Convolutional Neural Networks for the classification of glitches in
gravitational-wave data streams [52.77024349608834]
We classify transient noise signals (i.e.glitches) and gravitational waves in data from the Advanced LIGO detectors.
We use models with a supervised learning approach, both trained from scratch using the Gravity Spy dataset.
We also explore a self-supervised approach, pre-training models with automatically generated pseudo-labels.
arXiv Detail & Related papers (2023-03-24T11:12:37Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - DDKtor: Automatic Diadochokinetic Speech Analysis [13.68342426889044]
This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech.
Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems.
The LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
arXiv Detail & Related papers (2022-06-29T13:34:03Z) - Using GPT-2 to Create Synthetic Data to Improve the Prediction
Performance of NLP Machine Learning Classification Models [0.0]
It is becoming common practice to utilize synthetic data to boost the performance of Machine Learning Models.
I used a Yelp pizza restaurant reviews dataset and transfer learning to fine-tune a pre-trained GPT-2 Transformer Model to generate synthetic pizza reviews data.
I then combined this synthetic data with the original genuine data to create a new joint dataset.
arXiv Detail & Related papers (2021-04-02T20:20:42Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z) - Data augmentation using generative networks to identify dementia [20.137419355252362]
We show that generative models can be used as an effective approach for data augmentation.
In this paper, we investigate the application of a similar approach to different types of speech and audio-based features extracted from our automatic dementia detection system.
arXiv Detail & Related papers (2020-04-13T15:05:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.