SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts
- URL: http://arxiv.org/abs/2306.02207v3
- Date: Fri, 25 Aug 2023 16:10:18 GMT
- Title: SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts
- Authors: Haibin Wu, Kai-Wei Chang, Yuan-Kuei Wu, Hung-yi Lee
- Abstract summary: We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
- Score: 108.04306136086807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have gained considerable attention for
Artificial Intelligence Generated Content (AIGC), particularly with the
emergence of ChatGPT. However, the direct adaptation of continuous speech to
LLMs that process discrete tokens remains an unsolved challenge, hindering the
application of LLMs for speech generation. The advanced speech LMs are in the
corner, as that speech signals encapsulate a wealth of information, including
speaker and emotion, beyond textual data alone. Prompt tuning has demonstrated
notable gains in parameter efficiency and competitive performance on some
speech classification tasks. However, the extent to which prompts can
effectively elicit generation tasks from speech LMs remains an open question.
In this paper, we present pioneering research that explores the application of
prompt tuning to stimulate speech LMs for various generation tasks, within a
unified framework called SpeechGen, with around 10M trainable parameters. The
proposed unified framework holds great promise for efficiency and
effectiveness, particularly with the imminent arrival of advanced speech LMs,
which will significantly enhance the capabilities of the framework. The code
and demos of SpeechGen will be available on the project website:
\url{https://ga642381.github.io/SpeechPrompt/speechgen}
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture.
Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought [12.54786997634534]
This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST.
We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting.
The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units.
arXiv Detail & Related papers (2024-05-30T18:28:31Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks [94.30385972442387]
We propose SpeechPrompt v2, a prompt tuning framework capable of performing a wide variety of speech classification tasks.
Experiment result shows that SpeechPrompt v2 achieves performance on par with prior works with less than 0.15M trainable parameters.
arXiv Detail & Related papers (2023-03-01T18:47:41Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.