LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation
- URL: http://arxiv.org/abs/2506.11476v1
- Date: Fri, 13 Jun 2025 05:39:50 GMT
- Title: LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation
- Authors: Tom Baker, Javier Nistal,
- Abstract summary: ControlNet enables attaching external controls to a pre-trained generative model by cloning and fine-tuning its encoder on new conditionings.<n>We propose a lightweight, modular architecture that considerably reduces parameter count while matching ControlNet in audio quality and condition adherence.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-audio diffusion models produce high-quality and diverse music but many, if not most, of the SOTA models lack the fine-grained, time-varying controls essential for music production. ControlNet enables attaching external controls to a pre-trained generative model by cloning and fine-tuning its encoder on new conditionings. However, this approach incurs a large memory footprint and restricts users to a fixed set of controls. We propose a lightweight, modular architecture that considerably reduces parameter count while matching ControlNet in audio quality and condition adherence. Our method offers greater flexibility and significantly lower memory usage, enabling more efficient training and deployment of independent controls. We conduct extensive objective and subjective evaluations and provide numerous audio examples on the accompanying website at https://lightlatentcontrol.github.io
Related papers
- MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners [16.90741849156418]
MuseControlLite is designed to fine-tune text-to-music generation models for precise conditioning.<n> positional embeddings are critical when the condition of interest is a function of time.
arXiv Detail & Related papers (2025-06-23T15:08:03Z) - CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation [69.43106794519193]
We propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions.<n>Our framework reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights.
arXiv Detail & Related papers (2024-10-12T07:04:32Z) - BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features [19.284531698181116]
BandControlNet is designed to tackle the multiple music sequences and generate high-quality music samples conditioned to the giventemporal control features.
The proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed.
The subjective evaluations show trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming significantly using BandControlNet.
arXiv Detail & Related papers (2024-07-15T06:33:25Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
Video serves as a conditional control for a text-to-audio generation model.<n>We employ a well-performing text-to-audio model to consolidate the video control.<n>Our method shows superiority in terms of quality, controllability, and training efficiency.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model [62.51232333352754]
Ctrl-Adapter adds diverse controls to any image/video diffusion model through the adaptation of pretrained ControlNets.
With six diverse U-Net/DiT-based image/video diffusion models, Ctrl-Adapter matches the performance of pretrained ControlNets on COCO.
arXiv Detail & Related papers (2024-04-15T17:45:36Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - Content-based Controls For Music Large Language Modeling [6.17674772485321]
Coco-Mulla is a content-based control method for music large language modeling.
It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models.
Our approach achieves high-quality music generation with low-resource semi-supervised learning.
arXiv Detail & Related papers (2023-10-26T05:24:38Z) - Audio Generation with Multiple Conditional Diffusion Model [15.250081484817324]
We propose a novel model that enhances the controllability of existing pre-trained text-to-audio models.
This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio.
arXiv Detail & Related papers (2023-08-23T06:21:46Z) - Anticipatory Music Transformer [60.15347393822849]
We introduce anticipation: a method for constructing a controllable generative model of a temporal point process.
We focus on infilling control tasks, whereby the controls are a subset of the events themselves.
We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset.
arXiv Detail & Related papers (2023-06-14T16:27:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.