SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
- URL: http://arxiv.org/abs/2408.14080v4
- Date: Tue, 25 Feb 2025 03:22:49 GMT
- Title: SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
- Authors: Md Awsafur Rahman, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Bishmoy Paul, Shaikh Anowarul Fattah,
- Abstract summary: We introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD)<n>We highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection.<n>For long songs, our top-performing variant outperforms ViT by 8% in F1 score, is 38% faster, and uses 26% less memory.
- Score: 0.16777183511743465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent surge in AI-generated songs presents exciting possibilities and challenges. These innovations necessitate the ability to distinguish between human-composed and synthetic songs to safeguard artistic integrity and protect human musical artistry. Existing research and datasets in fake song detection only focus on singing voice deepfake detection (SVDD), where the vocals are AI-generated but the instrumental music is sourced from real songs. However, these approaches are inadequate for detecting contemporary end-to-end artificial songs where all components (vocals, music, lyrics, and style) could be AI-generated. Additionally, existing datasets lack music-lyrics diversity, long-duration songs, and open-access fake songs. To address these gaps, we introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD), comprising over 97k songs (4,751 hours) with over 49k synthetic songs from popular platforms like Suno and Udio. Furthermore, we highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection, an aspect entirely overlooked in existing methods. To utilize long-range patterns, we introduce SpecTTTra, a novel architecture that significantly improves time and memory efficiency over conventional CNN and Transformer-based models. For long songs, our top-performing variant outperforms ViT by 8% in F1 score, is 38% faster, and uses 26% less memory, while also surpassing ConvNeXt with a 1% F1 score gain, 20% speed boost, and 67% memory reduction.
Related papers
- AI-Generated Song Detection via Lyrics Transcripts [15.1799390517192]
Recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry.<n>We propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models.<n>Our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways.
arXiv Detail & Related papers (2025-06-23T10:42:50Z) - Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion [11.060929679400667]
We propose a multimodal, modular late-fusion pipeline that combines automatically transcribed lyrics and speech features capturing lyrics-related information within the audio.<n>Our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations.
arXiv Detail & Related papers (2025-06-19T02:56:49Z) - SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling [0.0]
To the best of our knowledge, there are no open-source high-quality datasets representing popular and well-known songs for generative music modeling tasks.<n>Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.
arXiv Detail & Related papers (2025-06-17T08:08:08Z) - Detecting Musical Deepfakes [0.0]
This study investigates the detection of AI-generated songs using the FakeMusicCaps dataset.<n>To simulate real-world adversarial conditions, tempo stretching and pitch shifting were applied to the dataset.<n>Mel spectrograms were generated from the modified audio, then used to train and evaluate a convolutional neural network.
arXiv Detail & Related papers (2025-05-03T21:45:13Z) - Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset [0.29998889086656577]
We show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music.
We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling.
arXiv Detail & Related papers (2025-02-10T11:30:35Z) - Detecting Music Performance Errors with Transformers [3.6837762419929168]
Existing tools for music error detection rely on automatic alignment.
There is a lack of sufficient data to train music error detection models.
We present a novel data generation technique capable of creating large-scale synthetic music error datasets.
arXiv Detail & Related papers (2025-01-03T07:04:20Z) - Audio Processing using Pattern Recognition for Music Genre Classification [0.0]
This project explores the application of machine learning techniques for music genre classification using the GTZAN dataset.
Motivated by the growing demand for personalized music recommendations, we focused on classifying five genres-Blues, Classical, Jazz, Hip Hop, and Country.
The ANN model demonstrated the best performance, achieving a validation accuracy of 92.44%.
arXiv Detail & Related papers (2024-10-19T05:44:05Z) - SongCreator: Lyrics-based Universal Song Generation [53.248473603201916]
SongCreator is a song-generation system designed to tackle the challenge of generating songs with both vocals and accompaniment given lyrics.
The model features two novel designs: a meticulously designed dual-sequence language model (M) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM.
Experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.
arXiv Detail & Related papers (2024-09-09T19:37:07Z) - Synthetic Lyrics Detection Across Languages and Genres [4.987546582439803]
Large language models (LLMs) to generate music content, particularly lyrics, has gained in popularity.
Previous research has explored content detection in various domains, but no work has focused on the text modality, lyrics, in music.
We curated a diverse dataset of real and synthetic lyrics from multiple languages, music genres, and artists.
We performed a thorough evaluation of existing synthetic text detection approaches on lyrics, a previously unexplored data type.
Following both music and industrial constraints, we examined how well these approaches generalize across languages, scale with data availability, handle multilingual language content, and perform on novel genres in few-shot settings
arXiv Detail & Related papers (2024-06-21T15:19:21Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - GETMusic: Generating Any Music Tracks with a Unified Representation and
Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music.
We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks''
GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time.
Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - An Analysis of Classification Approaches for Hit Song Prediction using
Engineered Metadata Features with Lyrics and Audio Features [5.871032585001082]
This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata.
Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron.
Our results show that Random Forest (RF) and Logistic Regression (LR) with all features outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively.
arXiv Detail & Related papers (2023-01-31T09:48:53Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Museformer: Transformer with Fine- and Coarse-Grained Attention for
Music Generation [138.74751744348274]
We propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation.
Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures.
With the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost.
arXiv Detail & Related papers (2022-10-19T07:31:56Z) - SongDriver: Real-time Music Accompaniment Generation without Logical
Latency nor Exposure Bias [15.7153621508319]
SongDriver is a real-time music accompaniment generation system without logical latency or exposure bias.
We train SongDriver on some open-source datasets and an original aiSong dataset built from Chinese-style modern pop music scores.
The results show that SongDriver outperforms existing SOTA (state-of-the-art) models on both objective and subjective metrics.
arXiv Detail & Related papers (2022-09-13T15:05:27Z) - Musika! Fast Infinite Waveform Music Generation [0.0]
We introduce Musika, a music generation system that can be trained on hundreds of hours of music using a single consumer GPU.
We achieve this by first learning a compact invertible representation of spectrogram magnitudes and phases with adversarial autoencoders.
A latent coordinate system enables generating arbitrarily long sequences of excerpts in parallel, while a global context vector allows the music to remain stylistically coherent through time.
arXiv Detail & Related papers (2022-08-18T08:31:15Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.