Related papers: A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

URL: http://arxiv.org/abs/2108.03456v1
Date: Sat, 7 Aug 2021 14:28:21 GMT
Title: A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis
Authors: Liwei Lin, Qiuqiang Kong, Junyan Jiang and Gus Xia
Abstract summary: We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources. The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
Score: 13.263771543118994
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a unified model for three inter-related tasks: 1) to \textit{separate} individual sound sources from a mixed music audio, 2) to \textit{transcribe} each sound source to MIDI notes, and 3) to\textit{ synthesize} new pieces based on the timbre of separated sources. The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre. To mirror such capability computationally, we designed a pitch-timbre disentanglement module based on a popular encoder-decoder neural architecture for source separation. The key inductive biases are vector-quantization for pitch representation and pitch-transformation invariant for timbre representation. In addition, we adopted a query-by-example method to achieve \textit{zero-shot} learning, i.e., the model is capable of doing source separation, transcription, and synthesis for \textit{unseen} instruments. The current design focuses on audio mixtures of two monophonic instruments. Experimental results show that our model outperforms existing multi-task baselines, and the transcribed score serves as a powerful auxiliary for separation tasks.

Related papers

Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music [8.468436398420764]
Music source separation is the task of separating a mixture of instruments into constituent tracks. We propose two ways of using musical scores to aid music source separation: a score-informed model and a score-only model. The score-informed model improves separation results compared to a baseline approach, but struggles to generalize from synthetic to real data.
arXiv Detail & Related papers (2025-03-10T14:08:31Z)
Separate This, and All of these Things Around It: Music Source Separation via Hyperellipsoidal Queries [53.30852012059025]
Music source separation is an audio-to-audio retrieval task. Recent work in music source separation has begun to challenge the fixed-stem paradigm. We propose the use of hyperellipsoidal regions as queries to allow for an intuitive yet easily parametrizable approach to specifying both the target (location) and its spread.
arXiv Detail & Related papers (2025-01-27T16:13:50Z)
Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model. Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z)
Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription [19.228155694144995]
Timbre-Trap is a novel framework which unifies music transcription and audio reconstruction. We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients. We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods.
arXiv Detail & Related papers (2023-09-27T15:19:05Z)
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z)
Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations. We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z)
Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z)
Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z)
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part. We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.