Catch-A-Waveform: Learning to Generate Audio from a Single Short Example
- URL: http://arxiv.org/abs/2106.06426v1
- Date: Fri, 11 Jun 2021 14:35:11 GMT
- Title: Catch-A-Waveform: Learning to Generate Audio from a Single Short Example
- Authors: Gal Greshler, Tamar Rott Shaham and Tomer Michaeli
- Abstract summary: We present a GAN-based generative model that can be trained on one short audio signal from any domain.
We show that in all cases, no more than 20 seconds of training audio commonly suffice for our model to achieve state-of-the-art results.
- Score: 33.96833901121411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models for audio generation are typically trained on hours of recordings.
Here, we illustrate that capturing the essence of an audio source is typically
possible from as little as a few tens of seconds from a single training signal.
Specifically, we present a GAN-based generative model that can be trained on
one short audio signal from any domain (e.g. speech, music, etc.) and does not
require pre-training or any other form of external supervision. Once trained,
our model can generate random samples of arbitrary duration that maintain
semantic similarity to the training waveform, yet exhibit new compositions of
its audio primitives. This enables a long line of interesting applications,
including generating new jazz improvisations or new a-cappella rap variants
based on a single short example, producing coherent modifications to famous
songs (e.g. adding a new verse to a Beatles song based solely on the original
recording), filling-in of missing parts (inpainting), extending the bandwidth
of a speech signal (super-resolution), and enhancing old recordings without
access to any clean training example. We show that in all cases, no more than
20 seconds of training audio commonly suffice for our model to achieve
state-of-the-art results. This is despite its complete lack of prior knowledge
about the nature of audio signals in general.
Related papers
- Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - Controllable Music Production with Diffusion Models and Guidance
Gradients [3.187381965457262]
We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in 44.1kHz stereo audio.
The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips.
arXiv Detail & Related papers (2023-11-01T16:01:01Z) - Audio-Driven Dubbing for User Generated Contents via Style-Aware
Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production.
In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - ArchiSound: Audio Generation with Diffusion [0.0]
In this work, we investigate the potential of diffusion models for audio generation.
We propose a new method for text-conditional latent audio diffusion with stacked 1D U-Nets.
For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU.
arXiv Detail & Related papers (2023-01-30T20:23:26Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.