Related papers: Anticipatory Music Transformer

Anticipatory Music Transformer

URL: http://arxiv.org/abs/2306.08620v2
Date: Thu, 25 Jul 2024 18:35:33 GMT
Title: Anticipatory Music Transformer
Authors: John Thickstun, David Hall, Chris Donahue, Percy Liang,
Abstract summary: We introduce anticipation: a method for constructing a controllable generative model of a temporal point process. We focus on infilling control tasks, whereby the controls are a subset of the events themselves. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset.
Score: 60.15347393822849
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of events and controls, such that controls appear following stopping times in the event sequence. This work is motivated by problems arising in the control of symbolic music generation. We focus on infilling control tasks, whereby the controls are a subset of the events themselves, and conditional generation completes a sequence of events given the fixed control events. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset. These models match the performance of autoregressive models for prompted music generation, with the additional capability to perform infilling control tasks, including accompaniment. Human evaluators report that an anticipatory model produces accompaniments with similar musicality to even music composed by humans over a 20-second clip.

Related papers

Steering Autoregressive Music Generation with Recursive Feature Machines [43.475981527010276]
MusicRFM is a framework that adapts Recursive Feature Machines (RFMs) to enable fine-grained, interpretable control over frozen, pre-trained music models.<n>RFMs analyze a model's internal gradients to produce interpretable "concept directions"<n>We present advanced mechanisms for this control, including dynamic, time-varying schedules and methods for the simultaneous enforcement of multiple musical properties.
arXiv Detail & Related papers (2025-10-21T23:23:14Z)
JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation [75.58351043849385]
Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they unify synchronously.<n>To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to interact these two processes by simultaneously generating continuous-temporal data and synchronous discrete events.<n>JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable models for interactive systems.
arXiv Detail & Related papers (2025-09-26T16:04:00Z)
How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control [28.81619544175742]
Time Series Editing (TSE) makes precise modifications while preserving temporal coherence.<n>We introduce the CocktailEdit framework to enable simultaneous, flexible control across different types of constraints.
arXiv Detail & Related papers (2025-06-05T17:32:00Z)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation. It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z)
Mind the Time: Temporally-Controlled Multi-Event Video Generation [65.05423863685866]
We present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. For the first time in the literature, our model offers control over the timing of events in generated videos.
arXiv Detail & Related papers (2024-12-06T18:52:20Z)
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [53.63948108922333]
MusicFlow is a cascaded text-to-music generation model based on flow matching. We leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation.
arXiv Detail & Related papers (2024-10-27T15:35:41Z)
BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features [19.284531698181116]
BandControlNet is designed to tackle the multiple music sequences and generate high-quality music samples conditioned to the giventemporal control features. The proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed. The subjective evaluations show trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming significantly using BandControlNet.
arXiv Detail & Related papers (2024-07-15T06:33:25Z)
MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens. We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis [15.670399197114012]
We propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment. Performance conditioning is a tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation and state-of-the-art FAD realism scores.
arXiv Detail & Related papers (2023-09-21T17:44:57Z)
Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls [5.597394612661976]
Polyffusion is a diffusion model that generates polyphonic music scores by regarding music as image-like piano roll representations. We show that by using internal and external controls, Polyffusion unifies a wide range of music creation tasks.
arXiv Detail & Related papers (2023-07-19T06:36:31Z)
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning. Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z)
Conditional Generation of Temporally-ordered Event Sequences [29.44608199294757]
We present a conditional generation model capable of capturing event cooccurrence as well as temporality of event sequences. This single model can address both temporal ordering, sorting a given sequence of events into the order they occurred, and event infilling, predicting new events which fit into a temporally-ordered sequence of existing ones.
arXiv Detail & Related papers (2020-12-31T18:10:18Z)
Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video. The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass. The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.