Improving End-to-End Speech Processing by Efficient Text Data
Utilization with Latent Synthesis
- URL: http://arxiv.org/abs/2310.05374v3
- Date: Tue, 24 Oct 2023 06:48:55 GMT
- Title: Improving End-to-End Speech Processing by Efficient Text Data
Utilization with Latent Synthesis
- Authors: Jianqiao Lu, Wenyong Huang, Nianzu Zheng, Xingshan Zeng, Yu Ting
Yeung, Xiao Chen
- Abstract summary: Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data.
We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models.
- Score: 17.604583337593677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a high performance end-to-end speech (E2E) processing model requires
an enormous amount of labeled speech data, especially in the era of
data-centric artificial intelligence. However, labeled speech data are usually
scarcer and more expensive for collection, compared to textual data. We propose
Latent Synthesis (LaSyn), an efficient textual data utilization framework for
E2E speech processing models. We train a latent synthesizer to convert textual
data into an intermediate latent representation of a pre-trained speech model.
These pseudo acoustic representations of textual data augment acoustic data for
model training. We evaluate LaSyn on low-resource automatic speech recognition
(ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an
E2E baseline trained on LibriSpeech train-clean-100, with relative word error
rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our
E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for
slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM)
and EM-Tree accuracies on STOP respectively. With fewer parameters, the results
of LaSyn are competitive to published state-of-the-art works. The results
demonstrate the quality of the augmented training data.
Related papers
- Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition [18.924716098922683]
Machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions.
We propose two novel techniques during training to mitigate the problems due to the distribution gap.
We show that these methods significantly improve the training of speech recognition models using synthetic data.
arXiv Detail & Related papers (2021-10-21T21:11:42Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Exploring Transfer Learning For End-to-End Spoken Language Understanding [8.317084844841323]
An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option.
We propose an E2E system that is designed to jointly train on multiple speech-to-text tasks.
We show that it beats the performance of E2E models trained on individual tasks.
arXiv Detail & Related papers (2020-12-15T19:02:15Z) - End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 [20.456325305495966]
This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task.
The task evaluates systems' ability to translate English TED talks audio into German texts.
Our system is an end-to-end model based on an adaptation of the Transformer for speech data.
arXiv Detail & Related papers (2020-06-04T15:47:47Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.