Actions Speak Louder than Listening: Evaluating Music Style Transfer
based on Editing Experience
- URL: http://arxiv.org/abs/2110.12855v1
- Date: Mon, 25 Oct 2021 12:20:30 GMT
- Title: Actions Speak Louder than Listening: Evaluating Music Style Transfer
based on Editing Experience
- Authors: Wei-Tsung Lu, Meng-Hsuan Wu, Yuh-Ming Chiu, Li Su
- Abstract summary: We propose an editing test to evaluate users' editing experience of music generation models in a systematic way.
Results on two target styles indicate that the improvement over the baseline model can be reflected by the editing test quantitatively.
- Score: 4.986422167919228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The subjective evaluation of music generation techniques has been mostly done
with questionnaire-based listening tests while ignoring the perspectives from
music composition, arrangement, and soundtrack editing. In this paper, we
propose an editing test to evaluate users' editing experience of music
generation models in a systematic way. To do this, we design a new music style
transfer model combining the non-chronological inference architecture,
autoregressive models and the Transformer, which serves as an improvement from
the baseline model on the same style transfer task. Then, we compare the
performance of the two models with a conventional listening test and the
proposed editing test, in which the quality of generated samples is assessed by
the amount of effort (e.g., the number of required keyboard and mouse actions)
spent by users to polish a music clip. Results on two target styles indicate
that the improvement over the baseline model can be reflected by the editing
test quantitatively. Also, the editing test provides profound insights which
are not accessible from usual listening tests. The major contribution of this
paper is the systematic presentation of the editing test and the corresponding
insights, while the proposed music style transfer model based on
state-of-the-art neural networks represents another contribution.
Related papers
- MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [24.6866990804501]
Instruct-MusicGen is a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions.
Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps.
arXiv Detail & Related papers (2024-05-28T17:27:20Z) - Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription [13.960714900433269]
Sheet Music Transformer is the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies.
Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively.
arXiv Detail & Related papers (2024-02-12T11:52:21Z) - StemGen: A music generation model that listens [9.489938613869864]
We present an alternative paradigm for producing music generation models that can listen and respond to musical context.
We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture.
The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.
arXiv Detail & Related papers (2023-12-14T08:09:20Z) - Investigating Personalization Methods in Text to Music Generation [21.71190700761388]
Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods.
For evaluation, we construct a novel dataset with prompts and music clips.
Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody.
arXiv Detail & Related papers (2023-09-20T08:36:34Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.