Related papers: MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

URL: http://arxiv.org/abs/2407.03188v2
Date: Thu, 11 Jul 2024 03:32:44 GMT
Title: MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation
Authors: Zihao Wang, Haoxuan Liu, Jiaxing Yu, Tao Zhang, Yan Liu, Kejun Zhang,
Abstract summary: We propose a novel task of Colloquial Description-to-Song Generation. It focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model.
Score: 18.181382408551574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives and a comprehensive understanding of colloquial descriptions. Unlike existing datasets pre-set with expert annotations or auto-generated ones with inherent biases, CaiMD caters more sufficiently to our purpose of aligning AI-generated music with widespread user-desired results. Moreover, we propose an innovative single-stage framework called MuDiT/MuSiT for enabling effective human-machine alignment in song creation. This framework not only achieves cross-modal comprehension between colloquial language and auditory music perceptions but also ensures generated songs align with user-desired results. MuDiT/MuSiT employs one DiT/SiT model for end-to-end generation of musical components like melody, harmony, rhythm, vocals, and instrumentation. The approach ensures harmonious sonic cohesiveness amongst all generated musical components, facilitating better resonance with human auditory expectations.

Related papers

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation [10.643965544581683]
We introduce MusiCoT, a novel chain-of-thought (CoT) prompting technique tailored for music generation. MusiCoT empowers the AR model to first outline an overall music structure before generating audio tokens. Our experimental results indicate that MusiCoT consistently achieves superior performance across both objective and subjective metrics.
arXiv Detail & Related papers (2025-03-25T12:51:21Z)
SongGLM: Lyric-to-Melody Generation with 2D Alignment Encoding and Multi-Task Pre-Training [7.3026780262967685]
SongGLM is a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training. We construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning.
arXiv Detail & Related papers (2024-12-24T02:30:07Z)
SongCreator: Lyrics-based Universal Song Generation [53.248473603201916]
SongCreator is a song-generation system designed to tackle the challenge of generating songs with both vocals and accompaniment given lyrics. The model features two novel designs: a meticulously designed dual-sequence language model (M) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM. Experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.
arXiv Detail & Related papers (2024-09-09T19:37:07Z)
Controllable Lyrics-to-Melody Generation [14.15838552524433]
We propose a controllable lyrics-to-melody generation network, ConL2M, which is able to generate realistic melodies from lyrics in user-desired musical style. Our work contains three main novelties: 1) To model the dependencies of music attributes cross multiple sequences, inter-branch memory fusion (Memofu) is proposed to enable information flow between multi-branch stacked LSTM architecture; 2) Reference style embedding (RSE) is proposed to improve the quality of generation as well as control the musical style of generated melodies; 3) Sequence-level statistical loss (SeqLoss) is proposed to help the model learn sequence-level
arXiv Detail & Related papers (2023-06-05T06:14:08Z)
Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data. We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints. Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z)
Unsupervised Melody-Guided Lyrics Generation [84.22469652275714]
We propose to generate pleasantly listenable lyrics without training on melody-lyric aligned data. We leverage the crucial alignments between melody and lyrics and compile the given melody into constraints to guide the generation process.
arXiv Detail & Related papers (2023-05-12T20:57:20Z)
Re-creation of Creations: A New Paradigm for Lyric-to-Melody Generation [158.54649047794794]
Re-creation of Creations (ROC) is a new paradigm for lyric-to-melody generation. ROC achieves good lyric-melody feature alignment in lyric-to-melody generation.
arXiv Detail & Related papers (2022-08-11T08:44:47Z)
Interpretable Melody Generation from Lyrics with Discrete-Valued Adversarial Training [12.02541352832997]
Gumbel-Softmax is exploited to solve the non-differentiability problem of generating music attributes by Generative Adversarial Networks (GANs) Users can listen to the generated AI song as well as recreate a new song by selecting from recommended music attributes.
arXiv Detail & Related papers (2022-06-30T05:45:47Z)
Flat latent manifolds for music improvisation between human and machine [9.571383193449648]
We consider a music-generating algorithm as a counterpart to a human musician, in a setting where reciprocal improvisation is to lead to new experiences. In the learned model, we generate novel musical sequences by quantification in latent space. We provide empirical evidence for our method via a set of experiments on music and we deploy our model for an interactive jam session with a professional drummer.
arXiv Detail & Related papers (2022-02-23T09:00:17Z)
Music Harmony Generation, through Deep Learning and Using a Multi-Objective Evolutionary Algorithm [0.0]
This paper introduces a genetic multi-objective evolutionary optimization algorithm for the generation of polyphonic music. One of the goals is the rules and regulations of music, which, along with the other two goals, including the scores of music experts and ordinary listeners, fits the cycle of evolution to get the most optimal response. The results show that the proposed method is able to generate difficult and pleasant pieces with desired styles and lengths, along with harmonic sounds that follow the grammar while attracting the listener, at the same time.
arXiv Detail & Related papers (2021-02-16T05:05:54Z)
SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint [54.012194728496155]
SongMASS is proposed to overcome the challenges of lyric-to-melody generation and melody-to-lyric generation. It leverages masked sequence to sequence (MASS) pre-training and attention based alignment modeling. We show that SongMASS generates lyric and melody with significantly better quality than the baseline method.
arXiv Detail & Related papers (2020-12-09T16:56:59Z)
Melody-Conditioned Lyrics Generation with SeqGANs [81.2302502902865]
We propose an end-to-end melody-conditioned lyrics generation system based on Sequence Generative Adversarial Networks (SeqGAN) We show that the input conditions have no negative impact on the evaluation metrics while enabling the network to produce more meaningful results.
arXiv Detail & Related papers (2020-10-28T02:35:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.