Related papers: Evaluating Disentangled Representations for Controllable Music Generation

Evaluating Disentangled Representations for Controllable Music Generation

URL: http://arxiv.org/abs/2602.10058v2
Date: Sun, 15 Feb 2026 20:31:32 GMT
Title: Evaluating Disentangled Representations for Controllable Music Generation
Authors: Laura Ibáñez-Martínez, Chukwuemeka Nkama, Andrea Poltronieri, Xavier Serra, Martín Rocamora,
Abstract summary: We evaluate disentangled representations in music audio models for controllable generation using a probing-based framework.<n>The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures.<n>Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations.
Score: 8.177554704838213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.

Related papers

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors [57.31788955167306]
Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information.<n>We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks.<n>Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors.
arXiv Detail & Related papers (2026-01-12T05:43:51Z)
Introspection in Learned Semantic Scene Graph Localisation [7.222321327403328]
This work investigates how semantics influence localisation performance and robustness in a self-supervised, contrastive semantic localisation framework.<n>We conduct a thorough post-hoc introspection analysis to probe whether the model filters environmental noise and prioritises distinctive landmarks over routine clutter.<n>Overall, the results indicate that the model learns noise-robust, semantically salient relations about place definition, thereby enabling explainable registration under challenging visual and structural variations.
arXiv Detail & Related papers (2025-10-08T14:21:45Z)
Spatial Reasoners for Continuous Variables in Any Domain [49.83744014336816]
We present a framework to perform spatial reasoning over continuous variables with generative denoising models.<n>We provide interfaces to control variable mapping from arbitrary data domains, generative model paradigms, and inference strategies.
arXiv Detail & Related papers (2025-07-14T19:46:54Z)
A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior [11.859145373647474]
We present the first large-scale benchmarking study designed to provide guidelines for domain shift strategies in seismic interpretation.<n>Our benchmark spans over 200 combinations of model architectures, datasets and training strategies, across three datasets.<n>Our analysis shows that common fine-tuning practices can lead to catastrophic forgetting when source and target datasets are disjoint.
arXiv Detail & Related papers (2025-05-13T13:56:43Z)
Nonparametric Factor Analysis and Beyond [14.232694150264628]
We propose a general framework for identifying latent variables in the non-negligible settings.<n>We show that the generative model is identifiable up to certain submanifold indeterminacies even in the presence of non-negligible noise.<n>We have also developed corresponding estimation methods and validated them in various synthetic and real-world settings.
arXiv Detail & Related papers (2025-03-21T05:45:03Z)
Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data [20.181313153447412]
This paper investigates the potential of Gaussian Flow Bridges, an emerging approach in generative modeling, for this problem. The presented framework addresses the transport problem across different distributions of audio signals through the implementation of a series of two deterministic probability flows. To address identified challenges on maintaining the speech content consistent, we recommend a training strategy that incorporates chunk-based minibatch Optimal Transport couplings of data samples and noise.
arXiv Detail & Related papers (2024-05-29T20:23:01Z)
Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case Study Using Music Audio [17.214062755082065]
Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models. We show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables. We propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions.
arXiv Detail & Related papers (2022-05-12T04:11:25Z)
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables. We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph. Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z)
Is Disentanglement enough? On Latent Representations for Controllable Music Generation [78.8942067357231]
In the absence of a strong generative decoder, disentanglement does not necessarily imply controllability. The structure of the latent space with respect to the VAE-decoder plays an important role in boosting the ability of a generative model to manipulate different attributes.
arXiv Detail & Related papers (2021-08-01T18:37:43Z)
Disentangling Action Sequences: Discovering Correlated Samples [6.179793031975444]
We demonstrate the data itself plays a crucial role in disentanglement and instead of the factors, and the disentangled representations align the latent variables with the action sequences. We propose a novel framework, fractional variational autoencoder (FVAE) to disentangle the action sequences with different significance step-by-step. Experimental results on dSprites and 3D Chairs show that FVAE improves the stability of disentanglement.
arXiv Detail & Related papers (2020-10-17T07:37:50Z)
Unsupervised Controllable Generation with Self-Training [90.04287577605723]
controllable generation with GANs remains a challenging research problem. We propose an unsupervised framework to learn a distribution of latent codes that control the generator through self-training. Our framework exhibits better disentanglement compared to other variants such as the variational autoencoder.
arXiv Detail & Related papers (2020-07-17T21:50:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.