A General Framework for Learning Procedural Audio Models of
Environmental Sounds
- URL: http://arxiv.org/abs/2303.02396v1
- Date: Sat, 4 Mar 2023 12:12:26 GMT
- Title: A General Framework for Learning Procedural Audio Models of
Environmental Sounds
- Authors: Danzel Serrano and Mark Cartwright
- Abstract summary: This paper introduces the Procedural (audio) Variational autoEncoder (ProVE) framework as a general approach to learning Procedural Audio PA models.
We show that ProVE models outperform both classical PA models and an adversarial-based approach in terms of sound fidelity.
- Score: 7.478290484139404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces the Procedural (audio) Variational autoEncoder (ProVE)
framework as a general approach to learning Procedural Audio PA models of
environmental sounds with an improvement to the realism of the synthesis while
maintaining provision of control over the generated sound through adjustable
parameters. The framework comprises two stages: (i) Audio Class Representation,
in which a latent representation space is defined by training an audio
autoencoder, and (ii) Control Mapping, in which a joint function of
static/temporal control variables derived from the audio and a random sample of
uniform noise is learned to replace the audio encoder. We demonstrate the use
of ProVE through the example of footstep sound effects on various surfaces. Our
results show that ProVE models outperform both classical PA models and an
adversarial-based approach in terms of sound fidelity, as measured by Fr\'echet
Audio Distance (FAD), Maximum Mean Discrepancy (MMD), and subjective
evaluations, making them feasible tools for sound design workflows.
Related papers
- Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance [42.90024643696503]
We present an end-to-end learning solution to jointly optimise the models for audio enhancement.
We consider four representative applications to evaluate our training paradigm.
arXiv Detail & Related papers (2024-08-12T16:23:58Z) - Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - CRASH: Raw Audio Score-based Generative Modeling for Controllable
High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis.
Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z) - Deep generative models for musical audio synthesis [0.0]
Sound modelling is the process of developing algorithms that generate sound under parametric control.
Recent generative deep learning systems for audio synthesis are able to learn models that can traverse arbitrary spaces of sound.
This paper is a review of developments in deep learning that are changing the practice of sound modelling.
arXiv Detail & Related papers (2020-06-10T04:02:42Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.