Related papers: Continuous descriptor-based control for deep audio synthesis

Related papers

TADA! Tuning Audio Diffusion Models through Activation Steering [3.563701362999877]
We show that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small subset of attention layers.<n>Applying Contrastive Activation Addition and Sparse Autoencoders, we can alter specific musical elements with high precision.
arXiv Detail & Related papers (2026-02-12T13:07:14Z)
Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control [66.46754271097555]
We release a fully open-source system for long-form song generation with fine-grained style conditioning.<n>The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions.<n>We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens.
arXiv Detail & Related papers (2026-01-07T14:40:48Z)
MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning [18.636738208526676]
MM-Sonate is a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities.<n>To enable zero-shot voice cloning, we introduce a classifier injection mechanism that effectively decouples speaker identity from linguistic content.<n> Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks.
arXiv Detail & Related papers (2026-01-04T15:26:15Z)
Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation [47.38557855930304]
We explore the application of denoising diffusion processes as plug-and-play latent constraints for symbolic music generation models.<n>We show that diffusion-driven constraints outperform traditional attribute regularization and other latent constraints architectures.
arXiv Detail & Related papers (2025-11-10T14:46:10Z)
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment [26.590667516155083]
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation.<n>Recent open lyrics-to-song models have set an acceptable standard in automatic song generation for recreational use.<n>Our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation.
arXiv Detail & Related papers (2025-07-28T14:34:02Z)
Controllable Video Generation: A Survey [72.38313362192784]
We provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field.<n>We begin by introducing the key concepts and commonly used open-source video generation models.<n>We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation.
arXiv Detail & Related papers (2025-07-22T06:05:34Z)
EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z)
Fine-Grained control over Music Generation with Activation Steering [0.0]
We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen.<n>Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it.
arXiv Detail & Related papers (2025-06-11T23:02:39Z)
Extending Visual Dynamics for Video-to-Music Generation [51.274561293909926]
DyViM is a novel framework to enhance dynamics modeling for video-to-music generation. High-level semantics are conveyed through a cross-attention mechanism. Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-04-10T09:47:26Z)
Revealing the Implicit Noise-based Imprint of Generative Models [71.94916898756684]
This paper presents a novel framework that leverages noise-based model-specific imprint for the detection task. By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data. Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.
arXiv Detail & Related papers (2025-03-12T12:04:53Z)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation. It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z)
Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation [52.0893266767733]
We propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features. To enhance the model's robustness to different synthesizer characteristics, we propose a synthesizer feature augmentation strategy.
arXiv Detail & Related papers (2024-11-14T03:57:21Z)
Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model. Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z)
MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens. We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z)
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation. Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
Bass Accompaniment Generation via Latent Diffusion [0.0]
We present a controllable system for generating single stems to accompany musical mixes of arbitrary length. At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations. Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
arXiv Detail & Related papers (2024-02-02T13:44:47Z)
Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis [15.670399197114012]
We propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment. Performance conditioning is a tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation and state-of-the-art FAD realism scores.
arXiv Detail & Related papers (2023-09-21T17:44:57Z)
Audio Generation with Multiple Conditional Diffusion Model [15.250081484817324]
We propose a novel model that enhances the controllability of existing pre-trained text-to-audio models. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio.
arXiv Detail & Related papers (2023-08-23T06:21:46Z)
Anticipatory Music Transformer [60.15347393822849]
We introduce anticipation: a method for constructing a controllable generative model of a temporal point process. We focus on infilling control tasks, whereby the controls are a subset of the events themselves. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset.
arXiv Detail & Related papers (2023-06-14T16:27:53Z)
Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
Timbre latent space: exploration and creative aspects [1.3764085113103222]
Recent studies show the ability of unsupervised models to learn invertible audio representations using Auto-Encoders. New possibilities for timbre manipulations are enabled with generative neural networks.
arXiv Detail & Related papers (2020-08-04T07:08:04Z)
RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning [69.20460466735852]
This paper presents a deep reinforcement learning algorithm for online accompaniment generation. The proposed algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part.
arXiv Detail & Related papers (2020-02-08T03:53:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.