SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts
- URL: http://arxiv.org/abs/2306.02207v3
- Date: Fri, 25 Aug 2023 16:10:18 GMT
- Title: SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts
- Authors: Haibin Wu, Kai-Wei Chang, Yuan-Kuei Wu, Hung-yi Lee
- Abstract summary: We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
- Score: 108.04306136086807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have gained considerable attention for
Artificial Intelligence Generated Content (AIGC), particularly with the
emergence of ChatGPT. However, the direct adaptation of continuous speech to
LLMs that process discrete tokens remains an unsolved challenge, hindering the
application of LLMs for speech generation. The advanced speech LMs are in the
corner, as that speech signals encapsulate a wealth of information, including
speaker and emotion, beyond textual data alone. Prompt tuning has demonstrated
notable gains in parameter efficiency and competitive performance on some
speech classification tasks. However, the extent to which prompts can
effectively elicit generation tasks from speech LMs remains an open question.
In this paper, we present pioneering research that explores the application of
prompt tuning to stimulate speech LMs for various generation tasks, within a
unified framework called SpeechGen, with around 10M trainable parameters. The
proposed unified framework holds great promise for efficiency and
effectiveness, particularly with the imminent arrival of advanced speech LMs,
which will significantly enhance the capabilities of the framework. The code
and demos of SpeechGen will be available on the project website:
\url{https://ga642381.github.io/SpeechPrompt/speechgen}
Related papers
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech [29.847183061204436]
Large language models (LLMs) can take into account users' emotions or speaking styles when providing their responses.
In this work, we utilize an end-to-end system with a speech encoder.
We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech.
arXiv Detail & Related papers (2024-10-02T01:32:47Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.