SpecSinGAN: Sound Effect Variation Synthesis Using Single-Image GANs
- URL: http://arxiv.org/abs/2110.07311v1
- Date: Thu, 14 Oct 2021 12:25:52 GMT
- Title: SpecSinGAN: Sound Effect Variation Synthesis Using Single-Image GANs
- Authors: Adri\'an Barahona-R\'ios, Tom Collins
- Abstract summary: Single-image generative adversarial networks learn from the internal distribution of a single training example to generate variations of it.
SpecSinGAN takes a single one-shot sound effect and produces novel variations of it, as if they were different takes from the same recording session.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single-image generative adversarial networks learn from the internal
distribution of a single training example to generate variations of it,
removing the need of a large dataset. In this paper we introduce SpecSinGAN, an
unconditional generative architecture that takes a single one-shot sound effect
(e.g., a footstep; a character jump) and produces novel variations of it, as if
they were different takes from the same recording session. We explore the use
of multi-channel spectrograms to train the model on the various layers that
comprise a single sound effect. A listening study comparing our model to real
recordings and to digital signal processing procedural audio models in terms of
sound plausibility and variation revealed that SpecSinGAN is more plausible and
varied than the procedural audio models considered, when using multi-channel
spectrograms. Sound examples can be found at the project website:
https://www.adrianbarahonarios.com/specsingan/
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - Sampling-Frequency-Independent Audio Source Separation Using Convolution
Layer Based on Impulse Invariant Method [67.24600975813419]
We propose a convolution layer capable of handling arbitrary sampling frequencies by a single deep neural network.
We show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
arXiv Detail & Related papers (2021-05-10T02:33:42Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - MTCRNN: A multi-scale RNN for directed audio texture synthesis [0.0]
We introduce a novel modelling approach for textures, combining recurrent neural networks trained at different levels of abstraction with a conditioning strategy that allows for user-directed synthesis.
We demonstrate the model's performance on a variety of datasets, examine its performance on various metrics, and discuss some potential applications.
arXiv Detail & Related papers (2020-11-25T09:13:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.