Audio representations for deep learning in sound synthesis: A review
- URL: http://arxiv.org/abs/2201.02490v1
- Date: Fri, 7 Jan 2022 15:08:47 GMT
- Title: Audio representations for deep learning in sound synthesis: A review
- Authors: Anastasia Natsiou and Sean O'Leary
- Abstract summary: This paper provides an overview of audio representations applied to sound synthesis using deep learning.
It also presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rise of deep learning algorithms has led many researchers to withdraw
from using classic signal processing methods for sound generation. Deep
learning models have achieved expressive voice synthesis, realistic sound
textures, and musical notes from virtual instruments. However, the most
suitable deep learning architecture is still under investigation. The choice of
architecture is tightly coupled to the audio representations. A sound's
original waveform can be too dense and rich for deep learning models to deal
with efficiently - and complexity increases training time and computational
cost. Also, it does not represent sound in the manner in which it is perceived.
Therefore, in many cases, the raw audio has been transformed into a compressed
and more meaningful form using upsampling, feature-extraction, or even by
adopting a higher level illustration of the waveform. Furthermore, conditional
on the form chosen, additional conditioning representations, different model
architectures, and numerous metrics for evaluating the reconstructed sound have
been investigated. This paper provides an overview of audio representations
applied to sound synthesis using deep learning. Additionally, it presents the
most significant methods for developing and evaluating a sound synthesis
architecture using deep learning models, always depending on the audio
representation.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Contrastive Learning from Synthetic Audio Doppelgangers [1.3754952818114714]
We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio.
By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelg"angers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes.
Despite the shift to randomly generated synthetic data, our method produces strong representations, competitive with real data on standard audio classification benchmarks.
arXiv Detail & Related papers (2024-06-09T21:44:06Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - An investigation of the reconstruction capacity of stacked convolutional
autoencoders for log-mel-spectrograms [2.3204178451683264]
In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand.
Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument compression.
This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch.
arXiv Detail & Related papers (2023-01-18T17:19:04Z) - Rigid-Body Sound Synthesis with Differentiable Modal Resonators [6.680437329908454]
We present a novel end-to-end framework for training a deep neural network to generate modal resonators for a given 2D shape and material.
We demonstrate our method on a dataset of synthetic objects, but train our model using an audio-domain objective.
arXiv Detail & Related papers (2022-10-27T10:34:38Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Deep generative models for musical audio synthesis [0.0]
Sound modelling is the process of developing algorithms that generate sound under parametric control.
Recent generative deep learning systems for audio synthesis are able to learn models that can traverse arbitrary spaces of sound.
This paper is a review of developments in deep learning that are changing the practice of sound modelling.
arXiv Detail & Related papers (2020-06-10T04:02:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.