SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
- URL: http://arxiv.org/abs/2401.13527v2
- Date: Thu, 25 Jan 2024 17:24:52 GMT
- Title: SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
- Authors: Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, Xipeng Qiu
- Abstract summary: Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
- Score: 56.913182262166316
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Benefiting from effective speech modeling, current Speech Large Language
Models (SLLMs) have demonstrated exceptional capabilities in in-context speech
generation and efficient generalization to unseen speakers. However, the
prevailing information modeling process is encumbered by certain redundancies,
leading to inefficiencies in speech generation. We propose Chain-of-Information
Generation (CoIG), a method for decoupling semantic and perceptual information
in large-scale speech generation. Building on this, we develop SpeechGPT-Gen,
an 8-billion-parameter SLLM efficient in semantic and perceptual information
modeling. It comprises an autoregressive model based on LLM for semantic
information modeling and a non-autoregressive model employing flow matching for
perceptual information modeling. Additionally, we introduce the novel approach
of infusing semantic information into the prior distribution to enhance the
efficiency of flow matching. Extensive experimental results demonstrate that
SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice
conversion, and speech-to-speech dialogue, underscoring CoIG's remarkable
proficiency in capturing and modeling speech's semantic and perceptual
dimensions. Code and models are available at
https://github.com/0nutation/SpeechGPT.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel)
Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z) - Scaling Properties of Speech Language Models [4.0142527158949415]
Speech Language Models (SLMs) aim to learn language from raw audio, without textual resources.
We estimate the scale at which our current methods will yield a SLM with the English proficiency of text-based Large Language Models (LLMs)
arXiv Detail & Related papers (2024-03-31T13:30:12Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z) - Augmentation Invariant Discrete Representation for Generative Spoken
Language Modeling [41.733860809136196]
We propose an effective and efficient method to learn robust discrete speech representation for generative spoken language modeling.
The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme.
We additionally evaluate our method on the speech-to-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.
arXiv Detail & Related papers (2022-09-30T14:15:03Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - An Effective Contextual Language Modeling Framework for Speech
Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks.
We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition.
We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.