SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal
Conversational Abilities
- URL: http://arxiv.org/abs/2305.11000v2
- Date: Fri, 19 May 2023 14:41:16 GMT
- Title: SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal
Conversational Abilities
- Authors: Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou,
Xipeng Qiu
- Abstract summary: SpeechGPT is a large language model with intrinsic cross-modal conversational abilities.
We employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning.
- Score: 39.07096632751864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal large language models are regarded as a crucial step towards
Artificial General Intelligence (AGI) and have garnered significant interest
with the emergence of ChatGPT. However, current speech-language models
typically adopt the cascade paradigm, preventing inter-modal knowledge
transfer. In this paper, we propose SpeechGPT, a large language model with
intrinsic cross-modal conversational abilities, capable of perceiving and
generating multi-model content. With discrete speech representations, we first
construct SpeechInstruct, a large-scale cross-modal speech instruction dataset.
Additionally, we employ a three-stage training strategy that includes
modality-adaptation pre-training, cross-modal instruction fine-tuning, and
chain-of-modality instruction fine-tuning. The experimental results demonstrate
that SpeechGPT has an impressive capacity to follow multi-modal human
instructions and highlight the potential of handling multiple modalities with
one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.
Related papers
- SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning [43.71388370559826]
This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information.
We used large language models to generate descriptions for multi-talker speech.
We trained our model with pre-training on this captioning task followed by instruction tuning.
arXiv Detail & Related papers (2024-08-25T17:05:26Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Cross-Modal Mutual Learning for Cued Speech Recognition [10.225972737967249]
We propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction.
Our model forces modality-specific information of different modalities to pass through a modality-invariant codebook.
We establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese.
arXiv Detail & Related papers (2022-12-02T10:45:33Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z) - lamBERT: Language and Action Learning Using Multimodal BERT [0.1942428068361014]
This study proposes the language and action learning using multimodal BERT (lamBERT) model.
Experiment is conducted in a grid environment that requires language understanding for the agent to act properly.
The lamBERT model obtained higher rewards in multitask settings and transfer settings when compared to other models.
arXiv Detail & Related papers (2020-04-15T13:54:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.