TADA! Tuning Audio Diffusion Models through Activation Steering
- URL: http://arxiv.org/abs/2602.11910v1
- Date: Thu, 12 Feb 2026 13:07:14 GMT
- Title: TADA! Tuning Audio Diffusion Models through Activation Steering
- Authors: Ćukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski, Kamil Deja,
- Abstract summary: We show that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small subset of attention layers.<n>Applying Contrastive Activation Addition and Sparse Autoencoders, we can alter specific musical elements with high precision.
- Score: 3.563701362999877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.
Related papers
- EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z) - Extending Visual Dynamics for Video-to-Music Generation [51.274561293909926]
DyViM is a novel framework to enhance dynamics modeling for video-to-music generation.<n>High-level semantics are conveyed through a cross-attention mechanism.<n>Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-04-10T09:47:26Z) - Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing [60.38045088180188]
We propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment.<n>We incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies.<n>Our method performs favorably against the state-of-the-art models on two primary benchmarks.
arXiv Detail & Related papers (2025-03-15T08:25:57Z) - SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion [78.77211425667542]
SayAnything is a conditional video diffusion framework that directly synthesizes lip movements from audio input.<n>Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation.
arXiv Detail & Related papers (2025-02-17T07:29:36Z) - FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment [11.796771978828403]
We introduce FolAI, a two-stage generative framework that produces temporally coherent and semantically controllable sound effects from video.<n>Results show that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic.<n>These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings.
arXiv Detail & Related papers (2024-12-19T16:37:19Z) - Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model.
Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example.
We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.<n>We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.<n>Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Performance Conditioning for Diffusion-Based Multi-Instrument Music
Synthesis [15.670399197114012]
We propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment.
Performance conditioning is a tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances.
Our prototype is evaluated using uncurated performances with diverse instrumentation and state-of-the-art FAD realism scores.
arXiv Detail & Related papers (2023-09-21T17:44:57Z) - Continuous descriptor-based control for deep audio synthesis [1.2599533416395767]
We introduce a deep generative audio model providing expressive and continuous descriptor-based control.
We enforce the controllability of real-time generation by explicitly removing musical features in the latent space.
We assess the performance of our method on a wide variety of sounds including instrumental, percussive and speech recordings.
arXiv Detail & Related papers (2023-02-27T06:40:11Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.