Differentiable Signal Processing With Black-Box Audio Effects
- URL: http://arxiv.org/abs/2105.04752v1
- Date: Tue, 11 May 2021 02:20:22 GMT
- Title: Differentiable Signal Processing With Black-Box Audio Effects
- Authors: Marco A. Mart\'inez Ram\'irez, Oliver Wang, Paris Smaragdis, Nicholas
J. Bryan
- Abstract summary: We present a data-driven approach to automate audio signal processing by incorporating stateful third-party, audio effects as layers within a deep neural network.
We show that our approach can yield results comparable to a specialized, state-of-the-art commercial solution for music mastering.
- Score: 44.93154498647659
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a data-driven approach to automate audio signal processing by
incorporating stateful third-party, audio effects as layers within a deep
neural network. We then train a deep encoder to analyze input audio and control
effect parameters to perform the desired signal manipulation, requiring only
input-target paired audio data as supervision. To train our network with
non-differentiable black-box effects layers, we use a fast, parallel stochastic
gradient approximation scheme within a standard auto differentiation graph,
yielding efficient end-to-end backpropagation. We demonstrate the power of our
approach with three separate automatic audio production applications: tube
amplifier emulation, automatic removal of breaths and pops from voice
recordings, and automatic music mastering. We validate our results with a
subjective listening test, showing our approach not only can enable new
automatic audio effects tasks, but can yield results comparable to a
specialized, state-of-the-art commercial solution for music mastering.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - Human Voice Pitch Estimation: A Convolutional Network with Auto-Labeled
and Synthetic Data [0.0]
We present a specialized convolutional neural network designed for pitch extraction.
Our approach combines synthetic data with auto-labeled acapella sung audio, creating a robust training environment.
This work paves the way for enhanced pitch extraction in both music and voice settings.
arXiv Detail & Related papers (2023-08-14T14:26:52Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Modulation Extraction for LFO-driven Audio Effects [5.740770499256802]
We propose a framework capable of extracting arbitrary LFO signals from processed audio across multiple digital audio effects, parameter settings, and instrument configurations.
We show how coupling the extraction model with a simple processing network enables training of end-to-end black-box models of unseen analog or digital LFO-driven audio effects.
We make our code available and provide the trained audio effect models in a real-time VST plugin.
arXiv Detail & Related papers (2023-05-22T17:33:07Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Music Mixing Style Transfer: A Contrastive Learning Approach to
Disentangle Audio Effects [23.29395422386749]
We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song.
This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording.
arXiv Detail & Related papers (2022-11-04T03:45:17Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Removing Distortion Effects in Music Using Deep Neural Networks [12.497836634060569]
This paper focuses on removing distortion and clipping applied to guitar tracks for music production.
It presents a comparative investigation of different deep neural network (DNN) architectures on this task.
We achieve exceptionally good results in distortion removal using DNNs for effects that superimpose the clean signal to the distorted signal.
arXiv Detail & Related papers (2022-02-03T16:26:29Z) - Audio Dequantization for High Fidelity Audio Generation in Flow-based
Neural Vocoder [29.63675159839434]
Flow-based neural vocoder has shown significant improvement in real-time speech generation task.
We propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation.
arXiv Detail & Related papers (2020-08-16T09:37:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.