Sound Model Factory: An Integrated System Architecture for Generative
Audio Modelling
- URL: http://arxiv.org/abs/2206.13085v1
- Date: Mon, 27 Jun 2022 07:10:22 GMT
- Title: Sound Model Factory: An Integrated System Architecture for Generative
Audio Modelling
- Authors: Lonce Wyse, Purnima Kamath, Chitralekha Gupta
- Abstract summary: We introduce a new system for data-driven audio sound model design built around two different neural network architectures.
The objective of the system is to generate interactively controllable sound models given (a) a range of sounds the model should be able to synthesize, and (b) a specification of the parametric controls for navigating that space of sounds.
- Score: 4.193940401637568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new system for data-driven audio sound model design built
around two different neural network architectures, a Generative Adversarial
Network(GAN) and a Recurrent Neural Network (RNN), that takes advantage of the
unique characteristics of each to achieve the system objectives that neither is
capable of addressing alone. The objective of the system is to generate
interactively controllable sound models given (a) a range of sounds the model
should be able to synthesize, and (b) a specification of the parametric
controls for navigating that space of sounds. The range of sounds is defined by
a dataset provided by the designer, while the means of navigation is defined by
a combination of data labels and the selection of a sub-manifold from the
latent space learned by the GAN. Our proposed system takes advantage of the
rich latent space of a GAN that consists of sounds that fill out the spaces
''between" real data-like sounds. This augmented data from the GAN is then used
to train an RNN for its ability to respond immediately and continuously to
parameter changes and to generate audio over unlimited periods of time.
Furthermore, we develop a self-organizing map technique for ``smoothing" the
latent space of GAN that results in perceptually smooth interpolation between
audio timbres. We validate this process through user studies. The system
contributes advances to the state of the art for generative sound model design
that include system configuration and components for improving interpolation
and the expansion of audio modeling capabilities beyond musical pitch and
percussive instrument sounds into the more complex space of audio textures.
Related papers
- ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [2.2927722373373247]
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects.
arXiv Detail & Related papers (2024-10-19T02:28:53Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - DiffMoog: a Differentiable Modular Synthesizer for Sound Matching [48.33168531500444]
DiffMoog is a differentiable modular synthesizer with a comprehensive set of modules typically found in commercial instruments.
Being differentiable, it allows integration into neural networks, enabling automated sound matching.
We introduce an open-source platform that comprises DiffMoog and an end-to-end sound matching framework.
arXiv Detail & Related papers (2024-01-23T08:59:21Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - A General Framework for Learning Procedural Audio Models of
Environmental Sounds [7.478290484139404]
This paper introduces the Procedural (audio) Variational autoEncoder (ProVE) framework as a general approach to learning Procedural Audio PA models.
We show that ProVE models outperform both classical PA models and an adversarial-based approach in terms of sound fidelity.
arXiv Detail & Related papers (2023-03-04T12:12:26Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - MTCRNN: A multi-scale RNN for directed audio texture synthesis [0.0]
We introduce a novel modelling approach for textures, combining recurrent neural networks trained at different levels of abstraction with a conditioning strategy that allows for user-directed synthesis.
We demonstrate the model's performance on a variety of datasets, examine its performance on various metrics, and discuss some potential applications.
arXiv Detail & Related papers (2020-11-25T09:13:53Z) - Deep generative models for musical audio synthesis [0.0]
Sound modelling is the process of developing algorithms that generate sound under parametric control.
Recent generative deep learning systems for audio synthesis are able to learn models that can traverse arbitrary spaces of sound.
This paper is a review of developments in deep learning that are changing the practice of sound modelling.
arXiv Detail & Related papers (2020-06-10T04:02:42Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.