Related papers: Estimating Musical Surprisal in Audio

Estimating Musical Surprisal in Audio

URL: http://arxiv.org/abs/2501.07474v1
Date: Mon, 13 Jan 2025 16:46:45 GMT
Title: Estimating Musical Surprisal in Audio
Authors: Mathias Rose Bjare, Giorgia Cantisani, Stefan Lattner, Gerhard Widmer,
Abstract summary: Information content (IC) of one-step predictions from an autoregressive model as a proxy for surprisal in symbolic music.<n>We train an autoregressive Transformer model to predict compressed latent audio representations of a pretrained autoencoder network.<n>We investigate the IC's relation to audio and musical features and find it correlated with timbral variations and loudness and, to a lesser extent, dissonance, rhythmic complexity, and onset density related to audio and musical features.
Score: 4.056099795258358
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In modeling musical surprisal expectancy with computational methods, it has been proposed to use the information content (IC) of one-step predictions from an autoregressive model as a proxy for surprisal in symbolic music. With an appropriately chosen model, the IC of musical events has been shown to correlate with human perception of surprise and complexity aspects, including tonal and rhythmic complexity. This work investigates whether an analogous methodology can be applied to music audio. We train an autoregressive Transformer model to predict compressed latent audio representations of a pretrained autoencoder network. We verify learning effects by estimating the decrease in IC with repetitions. We investigate the mean IC of musical segment types (e.g., A or B) and find that segment types appearing later in a piece have a higher IC than earlier ones on average. We investigate the IC's relation to audio and musical features and find it correlated with timbral variations and loudness and, to a lesser extent, dissonance, rhythmic complexity, and onset density related to audio and musical features. Finally, we investigate if the IC can predict EEG responses to songs and thus model humans' surprisal in music. We provide code for our method on github.com/sonycslparis/audioic.

Related papers

Evaluating Fake Music Detection Performance Under Audio Augmentations [0.0]
We construct a dataset consisting of both real and synthetic music generated using several systems.<n>We then apply a range of audio transformations and analyze how they affect classification accuracy.<n>We test the performance of a recent state-of-the-art musical deepfake detection model in the presence of audio augmentations.
arXiv Detail & Related papers (2025-07-07T16:15:02Z)
Learning Musical Representations for Music Performance Question Answering [10.912207282129753]
multimodal learning methods are incapable of dealing with fundamental problems within the music performances. Our primary backbone is designed to incorporate multimodal interactions within the context of music data. Our experiments show state-of-the-art effects on the Music AVQA datasets.
arXiv Detail & Related papers (2025-02-10T17:41:57Z)
Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders. During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z)
Controlling Surprisal in Music Generation via Information Content Curve Matching [3.5570874721859016]
We propose a novel method for controlling surprisal in music generation using sequence models. We define a metric called Instantaneous Information Content (IIC) The IIC serves as a proxy function for the perceived musical surprisal.
arXiv Detail & Related papers (2024-08-12T09:21:41Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z)
Relating Human Perception of Musicality to Prediction in a Predictive Coding Model [0.8062120534124607]
We explore the use of a neural network inspired by predictive coding for modeling human music perception. This network was developed based on the computational neuroscience theory of recurrent interactions in the hierarchical visual cortex. We adapt this network to model the hierarchical auditory system and investigate whether it will make similar choices to humans regarding the musicality of a set of random pitch sequences.
arXiv Detail & Related papers (2022-10-29T12:20:01Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities [6.832341432995627]
Music emotion recognition is an important task in MIR (Music Information Retrieval) research. One important step towards better models would be to understand what a model is actually learning from the data. We show how to derive explanations of model predictions in terms of spectrogram image segments that connect to the high-level emotion prediction.
arXiv Detail & Related papers (2021-06-14T22:49:19Z)
Audio Impairment Recognition Using a Correlation-Based Feature Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs. We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
Modeling Musical Structure with Artificial Neural Networks [0.0]
I explore the application of artificial neural networks to different aspects of musical structure modeling. I show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments. I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals.
arXiv Detail & Related papers (2020-01-06T18:35:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.