Modelling black-box audio effects with time-varying feature modulation
- URL: http://arxiv.org/abs/2211.00497v2
- Date: Tue, 9 May 2023 19:42:06 GMT
- Title: Modelling black-box audio effects with time-varying feature modulation
- Authors: Marco Comunit\`a, Christian J. Steinmetz, Huy Phan, Joshua D. Reiss
- Abstract summary: We show that scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression.
We propose the integration of time-varying feature-wise linear modulation into existing temporal convolutional backbones.
We demonstrate that our approach more accurately captures long-range dependencies for a range of fuzz and compressor implementations across both time and frequency domain metrics.
- Score: 13.378050193507907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning approaches for black-box modelling of audio effects have shown
promise, however, the majority of existing work focuses on nonlinear effects
with behaviour on relatively short time-scales, such as guitar amplifiers and
distortion. While recurrent and convolutional architectures can theoretically
be extended to capture behaviour at longer time scales, we show that simply
scaling the width, depth, or dilation factor of existing architectures does not
result in satisfactory performance when modelling audio effects such as fuzz
and dynamic range compression. To address this, we propose the integration of
time-varying feature-wise linear modulation into existing temporal
convolutional backbones, an approach that enables learnable adaptation of the
intermediate activations. We demonstrate that our approach more accurately
captures long-range dependencies for a range of fuzz and compressor
implementations across both time and frequency domain metrics. We provide sound
examples, source code, and pretrained models to faciliate reproducibility.
Related papers
- Comparative Study of State-based Neural Networks for Virtual Analog Audio Effects Modeling [0.0]
This article explores the application of machine learning advancements for virtual analog modeling.
We compare State-Space models and Linear Recurrent Units against the more common Long Short-Term Memory networks.
arXiv Detail & Related papers (2024-05-07T08:47:40Z) - TSLANet: Rethinking Transformers for Time Series Representation Learning [19.795353886621715]
Time series data is characterized by its intrinsic long and short-range dependencies.
We introduce a novel Time Series Lightweight Network (TSLANet) as a universal convolutional model for diverse time series tasks.
Our experiments demonstrate that TSLANet outperforms state-of-the-art models in various tasks spanning classification, forecasting, and anomaly detection.
arXiv Detail & Related papers (2024-04-12T13:41:29Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Differentiable Grey-box Modelling of Phaser Effects using Frame-based
Spectral Processing [21.053861381437827]
This work presents a differentiable digital signal processing approach to modelling phaser effects.
The proposed model processes audio in short frames to implement a time-varying filter in the frequency domain.
We show that the model can be trained to emulate an analog reference device, while retaining interpretable and adjustable parameters.
arXiv Detail & Related papers (2023-06-02T07:53:41Z) - On Compressing Sequences for Self-Supervised Speech Models [78.62210521316081]
We study fixed-length and variable-length subsampling along the time axis in self-supervised learning.
We find that variable-length subsampling performs particularly well under low frame rates.
If we have access to phonetic boundaries, we find no degradation in performance for an average frame rate as low as 10 Hz.
arXiv Detail & Related papers (2022-10-13T17:10:02Z) - Streamable Neural Audio Synthesis With Non-Causal Convolutions [1.8275108630751844]
We introduce a new method allowing to produce non-causal streaming models.
This allows to make any convolutional model compatible with real-time buffer-based processing.
We show how our method can be adapted to fit complex architectures with parallel branches.
arXiv Detail & Related papers (2022-04-14T16:00:32Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z) - Exploring Quality and Generalizability in Parameterized Neural Audio
Effects [0.0]
Deep neural networks have shown promise for music audio signal processing applications.
Results to date have tended to be constrained by low sample rates, noise, narrow domains of signal types, and/or lack of parameterized controls.
This work expands on prior research published on modeling nonlinear time-dependent signal processing effects.
arXiv Detail & Related papers (2020-06-10T00:52:08Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.