SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
- URL: http://arxiv.org/abs/2408.14080v3
- Date: Sun, 6 Oct 2024 04:03:35 GMT
- Title: SONICS: Synthetic Or Not -- Identifying Counterfeit Songs
- Authors: Md Awsafur Rahman, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Bishmoy Paul, Shaikh Anowarul Fattah,
- Abstract summary: We introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD)
We highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection.
In particular, for long audio samples, our top-performing variant outperforms ViT by 8% F1 score while being 38% faster and using 26% less memory.
- Score: 0.16777183511743465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent surge in AI-generated songs presents exciting possibilities and challenges. While these inventions democratize music creation, they also necessitate the ability to distinguish between human-composed and synthetic songs to safeguard artistic integrity and protect human musical artistry. Existing research and datasets in fake song detection only focus on singing voice deepfake detection (SVDD), where the vocals are AI-generated but the instrumental music is sourced from real songs. However, these approaches are inadequate for detecting contemporary end-to-end artificial songs where all components (vocals, music, lyrics, and style) could be AI-generated. Additionally, existing datasets lack music-lyrics diversity, long-duration songs, and open-access fake songs. To address these gaps, we introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD), comprising over 97k songs (4,751 hours) with over 49k synthetic songs from popular platforms like Suno and Udio. Furthermore, we highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection, an aspect entirely overlooked in existing methods. To utilize long-range patterns, we introduce SpecTTTra, a novel architecture that significantly improves time and memory efficiency over conventional CNN and Transformer-based models. In particular, for long audio samples, our top-performing variant outperforms ViT by 8% F1 score while being 38% faster and using 26% less memory. Additionally, in comparison with ConvNeXt, our model achieves 1% gain in F1 score with 20% boost in speed and 67% reduction in memory usage. Other variants of our model family provide even better speed and memory efficiency with competitive performance.
Related papers
- Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset [0.29998889086656577]
We show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music.
We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling.
arXiv Detail & Related papers (2025-02-10T11:30:35Z) - Detecting Music Performance Errors with Transformers [3.6837762419929168]
Existing tools for music error detection rely on automatic alignment.
There is a lack of sufficient data to train music error detection models.
We present a novel data generation technique capable of creating large-scale synthetic music error datasets.
arXiv Detail & Related papers (2025-01-03T07:04:20Z) - SongCreator: Lyrics-based Universal Song Generation [53.248473603201916]
SongCreator is a song-generation system designed to tackle the challenge of generating songs with both vocals and accompaniment given lyrics.
The model features two novel designs: a meticulously designed dual-sequence language model (M) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM.
Experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.
arXiv Detail & Related papers (2024-09-09T19:37:07Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - An Analysis of Classification Approaches for Hit Song Prediction using
Engineered Metadata Features with Lyrics and Audio Features [5.871032585001082]
This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata.
Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron.
Our results show that Random Forest (RF) and Logistic Regression (LR) with all features outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively.
arXiv Detail & Related papers (2023-01-31T09:48:53Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Museformer: Transformer with Fine- and Coarse-Grained Attention for
Music Generation [138.74751744348274]
We propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation.
Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures.
With the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost.
arXiv Detail & Related papers (2022-10-19T07:31:56Z) - SongDriver: Real-time Music Accompaniment Generation without Logical
Latency nor Exposure Bias [15.7153621508319]
SongDriver is a real-time music accompaniment generation system without logical latency or exposure bias.
We train SongDriver on some open-source datasets and an original aiSong dataset built from Chinese-style modern pop music scores.
The results show that SongDriver outperforms existing SOTA (state-of-the-art) models on both objective and subjective metrics.
arXiv Detail & Related papers (2022-09-13T15:05:27Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.