Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models
- URL: http://arxiv.org/abs/2309.12802v1
- Date: Fri, 22 Sep 2023 11:33:03 GMT
- Title: Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models
- Authors: Alexandre R. Ferreira, Cl\'audio E. C. Campelo
- Abstract summary: We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
- Score: 55.2480439325792
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To train transcriptor models that produce robust results, a large and diverse
labeled dataset is required. Finding such data with the necessary
characteristics is a challenging task, especially for languages less popular
than English. Moreover, producing such data requires significant effort and
often money. Therefore, a strategy to mitigate this problem is the use of data
augmentation techniques. In this work, we propose a framework that approaches
data augmentation based on deepfake audio. To validate the produced framework,
experiments were conducted using existing deepfake and transcription models. A
voice cloner and a dataset produced by Indians (in English) were selected,
ensuring the presence of a single accent in the dataset. Subsequently, the
augmented data was used to train speech to text models in various scenarios.
Related papers
- Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR.
We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Language Agnostic Data-Driven Inverse Text Normalization [6.43601166279978]
inverse text normalization (ITN) problem attracts the attention of researchers from various fields.
Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited.
We propose a language-agnostic data-driven ITN framework to fill this gap.
arXiv Detail & Related papers (2023-01-20T10:33:03Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - HUI-Audio-Corpus-German: A high quality TTS dataset [0.0]
"HUI-Audio-Corpus-German" is a large, open-source dataset for TTS engines, created with a processing pipeline.
This dataset produces high quality audio to transcription alignments and decreases manual effort needed for creation.
arXiv Detail & Related papers (2021-06-11T10:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.