STTATTS: Unified Speech-To-Text And Text-To-Speech Model
- URL: http://arxiv.org/abs/2410.18607v1
- Date: Thu, 24 Oct 2024 10:04:24 GMT
- Title: STTATTS: Unified Speech-To-Text And Text-To-Speech Model
- Authors: Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki,
- Abstract summary: We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters.
Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models.
- Score: 6.327929516375736
- License:
- Abstract: Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models while significantly saving computational and memory costs ($\sim$50\% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - Improving Massively Multilingual ASR With Auxiliary CTC Objectives [40.10307386370194]
We introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark.
We investigate techniques inspired from recent Connectionist Temporal Classification ( CTC) studies to help the model handle the large number of languages.
Our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER.
arXiv Detail & Related papers (2023-02-24T18:59:51Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Multitask Learning for Low Resource Spoken Language Understanding [26.106133114838215]
We train models on dual objectives with automatic speech recognition and intent classification or sentiment classification.
Our models, although being of modest size, show improvements over models trained end-to-end on intent classification.
We study the performance of the models in low-resource scenario by training the models with as few as one example per class.
arXiv Detail & Related papers (2022-11-24T16:38:17Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken
Conversations [22.894541507068933]
This paper presents our approach to build generalized models for the Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations Challenge of DSTC-10.
We employ extensive data augmentation strategies on written data, including artificial error injection and round-trip text-speech transformation.
Our approach ranks third on the objective evaluation and second on the final official human evaluation.
arXiv Detail & Related papers (2022-03-08T12:26:57Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.