Music Mixing Style Transfer: A Contrastive Learning Approach to
Disentangle Audio Effects
- URL: http://arxiv.org/abs/2211.02247v3
- Date: Tue, 11 Apr 2023 07:53:14 GMT
- Title: Music Mixing Style Transfer: A Contrastive Learning Approach to
Disentangle Audio Effects
- Authors: Junghyun Koo, Marco A. Mart\'inez-Ram\'irez, Wei-Hsiang Liao, Stefan
Uhlich, Kyogu Lee, Yuki Mitsufuji
- Abstract summary: We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song.
This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording.
- Score: 23.29395422386749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an end-to-end music mixing style transfer system that converts the
mixing style of an input multitrack to that of a reference song. This is
achieved with an encoder pre-trained with a contrastive objective to extract
only audio effects related information from a reference music recording. All
our models are trained in a self-supervised manner from an already-processed
wet multitrack dataset with an effective data preprocessing method that
alleviates the data scarcity of obtaining unprocessed dry data. We analyze the
proposed encoder for the disentanglement capability of audio effects and also
validate its performance for mixing style transfer through both objective and
subjective evaluations. From the results, we show the proposed system not only
converts the mixing style of multitrack audio close to a reference but is also
robust with mixture-wise style transfer upon using a music source separation
model.
Related papers
- Music Style Transfer With Diffusion Model [11.336043499372792]
This study proposes a music style transfer framework based on diffusion models (DM) and uses spectrogram-based methods to achieve multi-to-multi music style transfer.
The GuideDiff method is used to restore spectrograms to high-fidelity audio, accelerating audio generation speed and reducing noise in the generated audio.
arXiv Detail & Related papers (2024-04-23T06:22:19Z) - Bass Accompaniment Generation via Latent Diffusion [0.0]
We present a controllable system for generating single stems to accompany musical mixes of arbitrary length.
At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations.
Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
arXiv Detail & Related papers (2024-02-02T13:44:47Z) - High-Quality Visually-Guided Sound Separation from Diverse Categories [56.92841782969847]
DAVIS is a Diffusion-based Audio-VIsual Separation framework.
It synthesizes separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets.
arXiv Detail & Related papers (2023-07-31T19:41:49Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Automatic music mixing with deep learning and out-of-domain data [10.670987762781834]
Music mixing traditionally involves recording instruments in the form of clean, individual tracks and blending them into a final mixture using audio effects and expert knowledge.
We propose a novel data preprocessing method that allows the models to perform automatic music mixing.
We also redesigned a listening test method for evaluating music mixing systems.
arXiv Detail & Related papers (2022-08-24T10:50:22Z) - Improved singing voice separation with chromagram-based pitch-aware
remixing [26.299721372221736]
We propose chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed.
We demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)
arXiv Detail & Related papers (2022-03-28T20:55:54Z) - Automatic DJ Transitions with Differentiable Audio Effects and
Generative Adversarial Networks [30.480360404811197]
A central task of a Disc Jockey (DJ) is to create a mixset of mu-sic with seamless transitions between adjacent tracks.
In this paper, we explore a data-driven approach that uses a generative adversarial network to create the song transition by learning from real-world DJ mixes.
arXiv Detail & Related papers (2021-10-13T06:25:52Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Differentiable Signal Processing With Black-Box Audio Effects [44.93154498647659]
We present a data-driven approach to automate audio signal processing by incorporating stateful third-party, audio effects as layers within a deep neural network.
We show that our approach can yield results comparable to a specialized, state-of-the-art commercial solution for music mastering.
arXiv Detail & Related papers (2021-05-11T02:20:22Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.