Related papers: Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

URL: http://arxiv.org/abs/2510.23802v1
Date: Mon, 27 Oct 2025 19:35:39 GMT
Title: Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
Authors: Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow,
Abstract summary: We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties.<n>This enables both controllable manipulation and analysis of the AI music generation process.
Score: 4.757470067755357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

Related papers

Sparse Autoencoders Make Audio Foundation Models more Explainable [12.377299061408692]
We use Sparse Autoencoders (SAEs) to analyze the hidden representations of pretrained models.<n>SAEs retain both information about the original representations and class labels, enabling their internal structure.<n>We show that SAEs enhance the disentanglement of vocal attributes, establishing them as an effective tool for identifying the underlying factors encoded in the representations.
arXiv Detail & Related papers (2025-09-29T13:46:48Z)
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z)
AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure. We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
Timbre latent space: exploration and creative aspects [1.3764085113103222]
Recent studies show the ability of unsupervised models to learn invertible audio representations using Auto-Encoders. New possibilities for timbre manipulations are enabled with generative neural networks.
arXiv Detail & Related papers (2020-08-04T07:08:04Z)
Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.