Related papers: Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space

Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space

URL: http://arxiv.org/abs/2510.04339v1
Date: Sun, 05 Oct 2025 20:03:30 GMT
Title: Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space
Authors: Christian Limberg, Fares Schulz, Zhe Zhang, Stefan Weinzierl,
Abstract summary: This paper presents a novel approach to neural instrument sound synthesis using a two-stage semi-supervised learning framework.<n>We train a pitch-timbre disentangled 2D representation of audio samples using a Variational Autoencoder.<n>We use this representation as conditioning input for a Transformer-based generative model.
Score: 6.12877670327196
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents a novel approach to neural instrument sound synthesis using a two-stage semi-supervised learning framework capable of generating pitch-accurate, high-quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high-dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two-stage training paradigm: first, we train a pitch-timbre disentangled 2D representation of audio samples using a Variational Autoencoder; second, we use this representation as conditioning input for a Transformer-based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model's ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering: https://pgesam.faresschulz.com

Related papers

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders [4.757470067755357]
We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties.<n>This enables both controllable manipulation and analysis of the AI music generation process.
arXiv Detail & Related papers (2025-10-27T19:35:39Z)
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment [11.796771978828403]
We introduce FolAI, a two-stage generative framework that produces temporally coherent and semantically controllable sound effects from video.<n>Results show that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic.<n>These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings.
arXiv Detail & Related papers (2024-12-19T16:37:19Z)
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.<n>We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.<n>Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
Bass Accompaniment Generation via Latent Diffusion [0.0]
We present a controllable system for generating single stems to accompany musical mixes of arbitrary length. At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations. Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
arXiv Detail & Related papers (2024-02-02T13:44:47Z)
Controllable Music Production with Diffusion Models and Guidance Gradients [3.187381965457262]
We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in 44.1kHz stereo audio. The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips.
arXiv Detail & Related papers (2023-11-01T16:01:01Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning [127.1119359047849]
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. It generates highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
arXiv Detail & Related papers (2022-06-16T17:17:44Z)
Flat latent manifolds for music improvisation between human and machine [9.571383193449648]
We consider a music-generating algorithm as a counterpart to a human musician, in a setting where reciprocal improvisation is to lead to new experiences. In the learned model, we generate novel musical sequences by quantification in latent space. We provide empirical evidence for our method via a set of experiments on music and we deploy our model for an interactive jam session with a professional drummer.
arXiv Detail & Related papers (2022-02-23T09:00:17Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
Timbre latent space: exploration and creative aspects [1.3764085113103222]
Recent studies show the ability of unsupervised models to learn invertible audio representations using Auto-Encoders. New possibilities for timbre manipulations are enabled with generative neural networks.
arXiv Detail & Related papers (2020-08-04T07:08:04Z)
VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.