Segment Transformer: AI-Generated Music Detection via Music Structural Analysis
- URL: http://arxiv.org/abs/2509.08283v1
- Date: Wed, 10 Sep 2025 04:56:40 GMT
- Title: Segment Transformer: AI-Generated Music Detection via Music Structural Analysis
- Authors: Yumin Kim, Seonghyeon Go,
- Abstract summary: We aim to improve the accuracy of AIGM detection by analyzing the structural patterns of music segments.<n>Specifically, to extract musical features from short audio clips, we integrated various pre-trained models.<n>For long audio, we developed a segment transformer that divides music into segments and learns inter-segment relationships.
- Score: 1.7034813545878587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio and music generation systems have been remarkably developed in the music information retrieval (MIR) research field. The advancement of these technologies raises copyright concerns, as ownership and authorship of AI-generated music (AIGM) remain unclear. Also, it can be difficult to determine whether a piece was generated by AI or composed by humans clearly. To address these challenges, we aim to improve the accuracy of AIGM detection by analyzing the structural patterns of music segments. Specifically, to extract musical features from short audio clips, we integrated various pre-trained models, including self-supervised learning (SSL) models or an audio effect encoder, each within our suggested transformer-based framework. Furthermore, for long audio, we developed a segment transformer that divides music into segments and learns inter-segment relationships. We used the FakeMusicCaps and SONICS datasets, achieving high accuracy in both the short-audio and full-audio detection experiments. These findings suggest that integrating segment-level musical features into long-range temporal analysis can effectively enhance both the performance and robustness of AIGM detection systems.
Related papers
- Fusion Segment Transformer: Bi-Directional Attention Guided Fusion Network for AI-Generated Music Detection [1.7034813545878587]
We propose an improved version of the Segment Transformer, termed the Fusion Segment Transformer.<n>As in our previous work, we extract content embeddings from short music segments using diverse feature extractors.<n>We enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer.
arXiv Detail & Related papers (2026-01-20T06:31:05Z) - Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval [58.640807985155554]
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to a given query.<n>Most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality.<n>We propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR.
arXiv Detail & Related papers (2025-08-06T09:58:43Z) - Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - Detecting Musical Deepfakes [0.0]
This study investigates the detection of AI-generated songs using the FakeMusicCaps dataset.<n>To simulate real-world adversarial conditions, tempo stretching and pitch shifting were applied to the dataset.<n>Mel spectrograms were generated from the modified audio, then used to train and evaluate a convolutional neural network.
arXiv Detail & Related papers (2025-05-03T21:45:13Z) - Estimating Musical Surprisal in Audio [4.056099795258358]
Information content (IC) of one-step predictions from an autoregressive model as a proxy for surprisal in symbolic music.<n>We train an autoregressive Transformer model to predict compressed latent audio representations of a pretrained autoencoder network.<n>We investigate the IC's relation to audio and musical features and find it correlated with timbral variations and loudness and, to a lesser extent, dissonance, rhythmic complexity, and onset density related to audio and musical features.
arXiv Detail & Related papers (2025-01-13T16:46:45Z) - Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment [0.0]
Music102 is an advanced model aimed at enhancing chord progression accompaniment through a $D_12$-equivariant transformer.<n>Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry--such as transposition and reflection operations--integrating these properties into the transformer architecture.
arXiv Detail & Related papers (2024-10-23T03:11:01Z) - Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Unsupervised Learning of Deep Features for Music Segmentation [8.528384027684192]
Music segmentation is a problem of identifying boundaries between, and labeling, distinct music segments.
The performance of a range of music segmentation algorithms has been dependent on the audio features chosen to represent the audio.
In this work, unsupervised training of deep feature embeddings using convolutional neural networks (CNNs) is explored for music segmentation.
arXiv Detail & Related papers (2021-08-30T01:55:44Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Modeling Musical Structure with Artificial Neural Networks [0.0]
I explore the application of artificial neural networks to different aspects of musical structure modeling.
I show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments.
I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals.
arXiv Detail & Related papers (2020-01-06T18:35:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.