AudioPaLM: A Large Language Model That Can Speak and Listen
- URL: http://arxiv.org/abs/2306.12925v1
- Date: Thu, 22 Jun 2023 14:37:54 GMT
- Title: AudioPaLM: A Large Language Model That Can Speak and Listen
- Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur
Bapna, Zal\'an Borsos, F\'elix de Chaumont Quitry, Peter Chen, Dalia El
Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James
Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle
Tadmor, Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo
Velimirovi\'c, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil
Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank
- Abstract summary: We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
- Score: 79.44757696533709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce AudioPaLM, a large language model for speech understanding and
generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2
[Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified
multimodal architecture that can process and generate text and speech with
applications including speech recognition and speech-to-speech translation.
AudioPaLM inherits the capability to preserve paralinguistic information such
as speaker identity and intonation from AudioLM and the linguistic knowledge
present only in text large language models such as PaLM-2. We demonstrate that
initializing AudioPaLM with the weights of a text-only large language model
improves speech processing, successfully leveraging the larger quantity of text
training data used in pretraining to assist with the speech tasks. The
resulting model significantly outperforms existing systems for speech
translation tasks and has the ability to perform zero-shot speech-to-text
translation for many languages for which input/target language combinations
were not seen in training. AudioPaLM also demonstrates features of audio
language models, such as transferring a voice across languages based on a short
spoken prompt. We release examples of our method at
https://google-research.github.io/seanet/audiopalm/examples
Related papers
- Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models [13.855545744177586]
This paper examines the performance of existing audio language models in an underserved language using Thai.
Despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities.
This paper integrates audio comprehension and speech instruction-following capabilities into a single unified model.
arXiv Detail & Related papers (2024-09-17T09:04:03Z) - AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation [58.72068260933836]
The input and output of the system are multimodal (i.e., audio and visual speech)
We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages.
In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech.
arXiv Detail & Related papers (2023-12-05T05:36:44Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network.
It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model.
It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST)
We use discretized speech units, which are generated in a fully unsupervised way.
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.