Related papers: Neutone SDK: An Open Source Framework for Neural Audio Processing

Neutone SDK: An Open Source Framework for Neural Audio Processing

URL: http://arxiv.org/abs/2508.09126v1
Date: Tue, 12 Aug 2025 17:55:08 GMT
Title: Neutone SDK: An Open Source Framework for Neural Audio Processing
Authors: Christopher Mitcheltree, Bogdan Teleaga, Andrew Fyfe, Naotake Masuda, Matthias Schäfer, Alfie Bradic, Nao Tokui,
Abstract summary: We introduce the Neutone SDK: an open source framework that streamlines the deployment of PyTorch-based neural audio models.<n>We provide a technical overview of the interfaces needed to accomplish this, as well as the corresponding SDK implementations.<n>We also demonstrate the SDK's versatility across applications such as audio effect emulation, timbre transfer, and sample generation.
Score: 0.8062120534124608
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural audio processing has unlocked novel methods of sound transformation and synthesis, yet integrating deep learning models into digital audio workstations (DAWs) remains challenging due to real-time / neural network inference constraints and the complexities of plugin development. In this paper, we introduce the Neutone SDK: an open source framework that streamlines the deployment of PyTorch-based neural audio models for both real-time and offline applications. By encapsulating common challenges such as variable buffer sizes, sample rate conversion, delay compensation, and control parameter handling within a unified, model-agnostic interface, our framework enables seamless interoperability between neural models and host plugins while allowing users to work entirely in Python. We provide a technical overview of the interfaces needed to accomplish this, as well as the corresponding SDK implementations. We also demonstrate the SDK's versatility across applications such as audio effect emulation, timbre transfer, and sample generation, as well as its adoption by researchers, educators, companies, and artists alike. The Neutone SDK is available at https://github.com/Neutone/neutone_sdk

Related papers

MOVA: Towards Scalable and Synchronized Video-Audio Generation [91.56945636522345]
We introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content.<n>By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators.
arXiv Detail & Related papers (2026-02-09T15:31:54Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
POET: Prompt Offset Tuning for Continual Human Action Adaptation [61.63831623094721]
We aim to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually.<n>We formalize this as privacy-aware few-shot continual action recognition.<n>We propose a novel-temporal learnable prompt tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks.
arXiv Detail & Related papers (2025-04-25T04:11:24Z)
Designing Neural Synthesizers for Low-Latency Interaction [8.27756937768806]
We investigate the sources of latency and jitter typically found in interactive Neural Audio Synthesis (NAS) models.<n>We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder.<n>This culminates with a model we call BRAVE, which is low-latency and exhibits better pitch and loudness replication.
arXiv Detail & Related papers (2025-03-14T16:30:31Z)
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals [58.83169560132308]
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks.
arXiv Detail & Related papers (2024-07-18T17:59:01Z)
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z)
Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments. Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality. Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z)
Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Streamable Neural Audio Synthesis With Non-Causal Convolutions [1.8275108630751844]
We introduce a new method allowing to produce non-causal streaming models. This allows to make any convolutional model compatible with real-time buffer-based processing. We show how our method can be adapted to fit complex architectures with parallel branches.
arXiv Detail & Related papers (2022-04-14T16:00:32Z)
Real-time Timbre Transfer and Sound Synthesis using DDSP [1.7942265700058984]
We present a real-time implementation of the MagentaP library embedded in a virtual synthesizer as a plug-in. We focused on timbre transfer from learned representations of real instruments to arbitrary sound inputs as well as controlling these models by MIDI. We developed a GUI for intuitive high-level controls which can be used for post-processing and manipulating the parameters estimated by the neural network.
arXiv Detail & Related papers (2021-03-12T11:49:51Z)
MTCRNN: A multi-scale RNN for directed audio texture synthesis [0.0]
We introduce a novel modelling approach for textures, combining recurrent neural networks trained at different levels of abstraction with a conditioning strategy that allows for user-directed synthesis. We demonstrate the model's performance on a variety of datasets, examine its performance on various metrics, and discuss some potential applications.
arXiv Detail & Related papers (2020-11-25T09:13:53Z)
Neural Network Compression Framework for fast model inference [59.65531492759006]
We present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF) It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization. The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code.
arXiv Detail & Related papers (2020-02-20T11:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.