Related papers: Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

URL: http://arxiv.org/abs/2407.15641v1
Date: Mon, 22 Jul 2024 13:59:58 GMT
Title: Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models
Authors: Shahan Nercessian, Johannes Imort, Ninon Devis, Frederik Blang,
Abstract summary: We propose and investigate the use of neural audio language models for the automatic generation of sample-based musical instruments. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding.
Score: 2.3749120526936465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose and investigate the use of neural audio codec language models for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.

Related papers

EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation. It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z)
Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation [3.8570045844185237]
We present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset. Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems. We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix.
arXiv Detail & Related papers (2024-08-05T14:34:40Z)
InstrumentGen: Generating Sample-Based Musical Instruments From Text [3.4447129363520337]
We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. We propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding.
arXiv Detail & Related papers (2023-11-07T20:45:59Z)
Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z)
An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms [2.3204178451683264]
In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand. Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument compression. This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch.
arXiv Detail & Related papers (2023-01-18T17:19:04Z)
Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer [49.01417720472321]
We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody. Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice.
arXiv Detail & Related papers (2022-08-26T02:54:57Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
Time-Frequency Scattering Accurately Models Auditory Similarities Between Instrumental Playing Techniques [5.923588533979649]
We show that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. We propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques.
arXiv Detail & Related papers (2020-07-21T16:37:15Z)
RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning [69.20460466735852]
This paper presents a deep reinforcement learning algorithm for online accompaniment generation. The proposed algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part.
arXiv Detail & Related papers (2020-02-08T03:53:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.