HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW
- URL: http://arxiv.org/abs/2503.02977v1
- Date: Tue, 04 Mar 2025 20:01:40 GMT
- Title: HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW
- Authors: Christodoulos Benetatos, Frank Cwitkowitz, Nathan Pruyne, Hugo Flores Garcia, Patrick O'Reilly, Zhiyao Duan, Bryan Pardo,
- Abstract summary: HARP 2.0 brings deep learning models to digital audio workstation (DAW) through hosted, asynchronous, remote processing.<n>Users can route audio from a plug-in interface through any compatible Gradio endpoint to perform arbitrary transformations.
- Score: 18.32614229984195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: HARP 2.0 brings deep learning models to digital audio workstation (DAW) software through hosted, asynchronous, remote processing, allowing users to route audio from a plug-in interface through any compatible Gradio endpoint to perform arbitrary transformations. HARP renders endpoint-defined controls and processed audio in-plugin, meaning users can explore a variety of cutting-edge deep learning models without ever leaving the DAW. In the 2.0 release we introduce support for MIDI-based models and audio/MIDI labeling models, provide a streamlined pyharp Python API for model developers, and implement numerous interface and stability improvements. Through this work, we hope to bridge the gap between model developers and creatives, improving access to deep learning models by seamlessly integrating them into DAW workflows.
Related papers
- Designing Neural Synthesizers for Low-Latency Interaction [8.27756937768806]
We investigate the sources of latency and jitter typically found in interactive Neural Audio Synthesis (NAS) models.
We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder.
This culminates with a model we call BRAVE, which is low-latency and exhibits better pitch and loudness replication.
arXiv Detail & Related papers (2025-03-14T16:30:31Z) - AudioX: Diffusion Transformer for Anything-to-Audio Generation [72.84633243365093]
AudioX is a unified Diffusion Transformer model for Anything-to-Audio and Music Generation.
It can generate both general audio and music with high quality, while offering flexible natural language control.
To address data scarcity, we curate two datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset.
arXiv Detail & Related papers (2025-03-13T16:30:59Z) - InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation [43.690876909464336]
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation.<n>A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model.<n>Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information.
arXiv Detail & Related papers (2025-02-28T09:58:25Z) - Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - Multi-Source Music Generation with Latent Diffusion [7.832209959041259]
Multi-Source Diffusion Model (MSDM) proposed to model music as a mixture of multiple instrumental sources.
MSLDM employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation.
This approach significantly enhances the total and partial generation of music.
arXiv Detail & Related papers (2024-09-10T03:41:10Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Foundational GPT Model for MEG [3.524869467682149]
We propose two classes of deep learning foundational models that can be trained using forecasting of unlabelled brain signals.
First, we consider a modified Wavenet; and second, we consider a modified Transformer-based (GPT2) model.
We compare the performance of these deep learning models with standard linear autoregressive (AR) modelling on MEG data.
arXiv Detail & Related papers (2024-04-14T13:48:24Z) - ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models [51.35570730554632]
ESPnet-SPK is a toolkit for training speaker embedding extractors.
We provide several models, ranging from x-vector to recent SKA-TDNN.
We also aspire to bridge developed models with other domains.
arXiv Detail & Related papers (2024-01-30T18:18:27Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Real-time Timbre Transfer and Sound Synthesis using DDSP [1.7942265700058984]
We present a real-time implementation of the MagentaP library embedded in a virtual synthesizer as a plug-in.
We focused on timbre transfer from learned representations of real instruments to arbitrary sound inputs as well as controlling these models by MIDI.
We developed a GUI for intuitive high-level controls which can be used for post-processing and manipulating the parameters estimated by the neural network.
arXiv Detail & Related papers (2021-03-12T11:49:51Z) - Towards democratizing music production with AI-Design of Variational
Autoencoder-based Rhythm Generator as a DAW plugin [0.0]
This paper proposes a Variational AutoencoderciteKingma2014(VAE)-based rhythm generation system.
Musicians can train a deep learning model only by selecting target MIDI files, then generate various rhythms with the model.
arXiv Detail & Related papers (2020-04-01T10:50:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.