AudioLM: a Language Modeling Approach to Audio Generation
- URL: http://arxiv.org/abs/2209.03143v2
- Date: Wed, 26 Jul 2023 03:52:36 GMT
- Title: AudioLM: a Language Modeling Approach to Audio Generation
- Authors: Zal\'an Borsos, Rapha\"el Marinier, Damien Vincent, Eugene Kharitonov,
Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David
Grangier, Marco Tagliasacchi, Neil Zeghidour
- Abstract summary: We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
- Score: 59.19364975706805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce AudioLM, a framework for high-quality audio generation with
long-term consistency. AudioLM maps the input audio to a sequence of discrete
tokens and casts audio generation as a language modeling task in this
representation space. We show how existing audio tokenizers provide different
trade-offs between reconstruction quality and long-term structure, and we
propose a hybrid tokenization scheme to achieve both objectives. Namely, we
leverage the discretized activations of a masked language model pre-trained on
audio to capture long-term structure and the discrete codes produced by a
neural audio codec to achieve high-quality synthesis. By training on large
corpora of raw audio waveforms, AudioLM learns to generate natural and coherent
continuations given short prompts. When trained on speech, and without any
transcript or annotation, AudioLM generates syntactically and semantically
plausible speech continuations while also maintaining speaker identity and
prosody for unseen speakers. Furthermore, we demonstrate how our approach
extends beyond speech by generating coherent piano music continuations, despite
being trained without any symbolic representation of music.
Related papers
- Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [11.62674351793]
We introduce a novel audio-based TTS model to adapt context features with multiple enhancements.
Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer.
Our proposed method outperforms baselines across various context TTS scenarios.
arXiv Detail & Related papers (2024-06-06T03:06:45Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - SoundStorm: Efficient Parallel Audio Generation [27.121920017380273]
We present SoundStorm, a model for efficient, non-autoregressive audio generation.
SoundStorm receives as semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding.
We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments.
arXiv Detail & Related papers (2023-05-16T17:41:25Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.