Related papers: Towards Effective Negation Modeling in Joint Audio-Text Models for Music

Towards Effective Negation Modeling in Joint Audio-Text Models for Music

URL: http://arxiv.org/abs/2601.13931v1
Date: Tue, 20 Jan 2026 13:06:48 GMT
Title: Towards Effective Negation Modeling in Joint Audio-Text Models for Music
Authors: Yannis Vasilakis, Rachel Bittner, Johan Pauwels,
Abstract summary: Joint audio-text models struggle with semantic phenomena such as negation.<n>We introduce negation through text augmentation and a dissimilarity-based contrastive loss.<n>We propose two protocols that frame negation modeling as retrieval and binary classification tasks.
Score: 3.7723788828505125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Joint audio-text models are widely used for music retrieval, yet they struggle with semantic phenomena such as negation. Negation is fundamental for distinguishing the absence (or presence) of musical elements (e.g., "with vocals" vs. "without vocals"), but current systems fail to represent this reliably. In this work, we investigate and mitigate this limitation by training CLAP models from scratch on the Million Song Dataset with LP-MusicCaps-MSD captions. We introduce negation through text augmentation and a dissimilarity-based contrastive loss, designed to explicitly separate original and negated captions in the joint embedding space. To evaluate progress, we propose two protocols that frame negation modeling as retrieval and binary classification tasks. Experiments demonstrate that both methods, individually and combined, improve negation handling while largely preserving retrieval performance.

Related papers

SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models [17.194017001016135]
We show that the embedding space of Vision-Language Models can be divided into semantically consistent subspaces.<n>We propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point.<n>Our method improves negation understanding by about 30% on average over prior methods.
arXiv Detail & Related papers (2025-11-15T19:18:40Z)
LeVo: High-Quality Song Generation with Multi-Preference Alignment [47.965028296133426]
We introduce LeVo, a language model based framework consisting of LeLM and Music Codec.<n>LeVo is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment.<n>It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types.
arXiv Detail & Related papers (2025-06-09T07:57:24Z)
Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning [10.558648773612191]
We propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically.<n>We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training.
arXiv Detail & Related papers (2025-05-29T09:50:07Z)
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [81.3306413498174]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber.
arXiv Detail & Related papers (2025-05-02T13:30:19Z)
Extract Free Dense Misalignment from CLIP [7.0247398611254175]
This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP.<n>We revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment.<n>Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models.
arXiv Detail & Related papers (2024-12-24T12:51:05Z)
Evaluation of pretrained language models on music understanding [0.0]
We demonstrate that Large Language Models (LLM) suffer from 1) prompt sensitivity, 2) inability to model negation, and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.
arXiv Detail & Related papers (2024-09-17T14:44:49Z)
Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z)
DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z)
Resource-constrained stereo singing voice cancellation [1.0962868591006976]
We study the problem of stereo singing voice cancellation. Our approach is evaluated using objective offline metrics and a large-scale MUSHRA trial.
arXiv Detail & Related papers (2024-01-22T16:05:30Z)
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types. We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input. In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding. We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z)
Audio Impairment Recognition Using a Correlation-Based Feature Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.