SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
- URL: http://arxiv.org/abs/2401.18045v1
- Date: Wed, 31 Jan 2024 18:06:29 GMT
- Title: SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
- Authors: Yihan Wu, Soumi Maiti, Yifan Peng, Wangyou Zhang, Chenda Li, Yuyue
Wang, Xihua Wang, Shinji Watanabe, Ruihua Song
- Abstract summary: Speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.
We propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens.
- Score: 67.08798754009153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in language models have significantly enhanced
performance in multiple speech-related tasks. Existing speech language models
typically utilize task-dependent prompt tokens to unify various speech tasks in
a single model. However, this design omits the intrinsic connections between
different speech tasks, which can potentially boost the performance of each
task. In this work, we propose a novel decoder-only speech language model,
SpeechComposer, that can unify common speech tasks by composing a fixed set of
prompt tokens. Built upon four primary tasks -- speech synthesis, speech
recognition, speech language modeling, and text language modeling --
SpeechComposer can easily extend to more speech tasks via compositions of
well-designed prompt tokens, like voice conversion and speech enhancement. The
unification of prompt tokens also makes it possible for knowledge sharing among
different speech tasks in a more structured manner. Experimental results
demonstrate that our proposed SpeechComposer can improve the performance of
both primary tasks and composite tasks, showing the effectiveness of the shared
prompt tokens. Remarkably, the unified decoder-only model achieves a comparable
and even better performance than the baselines which are expert models designed
for single tasks.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models [19.719401865551745]
We present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis, and two speech classification tasks.
PolySpeech shows competitiveness across various tasks compared to single-task models.
arXiv Detail & Related papers (2024-06-12T01:35:46Z) - SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - Voxtlm: unified decoder-only models for consolidating speech
recognition/synthesis and speech/text continuation tasks [61.3055230762097]
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.
VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning.
arXiv Detail & Related papers (2023-09-14T03:13:18Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.