Related papers: ConceptCaps -- a Distilled Concept Dataset for Interpretability in Music Models

ConceptCaps -- a Distilled Concept Dataset for Interpretability in Music Models

URL: http://arxiv.org/abs/2601.14157v1
Date: Tue, 20 Jan 2026 17:04:08 GMT
Title: ConceptCaps -- a Distilled Concept Dataset for Interpretability in Music Models
Authors: Bruno Sienkiewicz, Łukasz Neumann, Mateusz Modrzejewski,
Abstract summary: ConceptCaps is a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy.<n>A VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio.<n>We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns.
Score: 0.10923877073891443
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.

Related papers

Towards Effective Negation Modeling in Joint Audio-Text Models for Music [3.7723788828505125]
Joint audio-text models struggle with semantic phenomena such as negation.<n>We introduce negation through text augmentation and a dissimilarity-based contrastive loss.<n>We propose two protocols that frame negation modeling as retrieval and binary classification tasks.
arXiv Detail & Related papers (2026-01-20T13:06:48Z)
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders [4.757470067755357]
We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties.<n>This enables both controllable manipulation and analysis of the AI music generation process.
arXiv Detail & Related papers (2025-10-27T19:35:39Z)
AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation [16.047087043580053]
Multimodal Large Language Models (MLLMs) have been widely applied in speech and music.<n>Unlike semantic-only text tokens, audio tokens must both capture global semantic content and preserve fine-grained acoustic details.<n>This paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework.
arXiv Detail & Related papers (2025-09-02T14:15:22Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning [10.558648773612191]
We propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically.<n>We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training.
arXiv Detail & Related papers (2025-05-29T09:50:07Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders. During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z)
Evaluation of pretrained language models on music understanding [0.0]
We demonstrate that Large Language Models (LLM) suffer from 1) prompt sensitivity, 2) inability to model negation, and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.
arXiv Detail & Related papers (2024-09-17T14:44:49Z)
C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities. Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning. The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z)
Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.