Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models
- URL: http://arxiv.org/abs/2409.14131v1
- Date: Sat, 21 Sep 2024 12:50:53 GMT
- Title: Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models
- Authors: Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna,
- Abstract summary: We investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD)
Speaker recognition SFM representations perform the best amongst all the foundation models (FMs)
- Score: 12.034908403186202
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this study, for the first time, we extensively investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD), which has recently attracted attention in the research community. For this, we perform a comprehensive comparative study of state-of-the-art (SOTA) MFMs (MERT variants and music2vec) and SFMs (pre-trained for general speech representation learning as well as speaker recognition). We show that speaker recognition SFM representations perform the best amongst all the foundation models (FMs), and this performance can be attributed to its higher efficacy in capturing the pitch, tone, intensity, etc, characteristics present in singing voices. To our end, we also explore the fusion of FMs for exploiting their complementary behavior for improved SVDD, and we propose a novel framework, FIONA for the same. With FIONA, through the synchronization of x-vector (speaker recognition SFM) and MERT-v1-330M (MFM), we report the best performance with the lowest Equal Error Rate (EER) of 13.74 %, beating all the individual FMs as well as baseline FM fusions and achieving SOTA results.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - On the Evaluation of Speech Foundation Models for Spoken Language Understanding [87.52911510306011]
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking.
The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks.
We ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs?
arXiv Detail & Related papers (2024-06-14T14:37:52Z) - Development of a Reliable and Accessible Caregiving Language Model
(CaLM) [1.1487735059279973]
This study aimed to develop a reliable Caregiving Language Model (CaLM) by using FMs and a caregiving knowledge base.
We developed CaLM using the Retrieval Augmented Generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers.
The study shows that reliable and accessible CaLM can be developed by using small FMs with a knowledge base specific to the caregiving domain.
arXiv Detail & Related papers (2024-03-11T16:12:34Z) - Speech foundation models on intelligibility prediction for
hearing-impaired listeners [4.742307809368852]
Speech foundation models (SFMs) have been benchmarked on many speech processing tasks.
We present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction.
We propose a simple method that learns a specialized prediction head on top of frozen SFMs to approach the problem.
arXiv Detail & Related papers (2024-01-24T18:26:52Z) - Learn From Model Beyond Fine-Tuning: A Survey [78.80920533793595]
Learn From Model (LFM) focuses on the research, modification, and design of foundation models (FM) based on the model interface.
The study of LFM techniques can be broadly categorized into five major areas: model tuning, model distillation, model reuse, meta learning and model editing.
This paper gives a comprehensive review of the current methods based on FM from the perspective of LFM.
arXiv Detail & Related papers (2023-10-12T10:20:36Z) - VideoGLUE: Video General Understanding Evaluation of Foundation Models [89.07145427268948]
We evaluate video understanding capabilities of foundation models (FMs) using a carefully designed experiment protocol.
We jointly profile FMs' hallmark and efficacy efficiency when adapting to general video understanding tasks.
arXiv Detail & Related papers (2023-07-06T17:47:52Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - Comparison of Speech Representations for the MOS Prediction System [1.2949520455740093]
We conduct experiments on a large-scale listening test corpus collected from past Blizzard and Voice Conversion Challenges.
We find that the wav2vec feature set showed the best generalization even though the given ground-truth was not always reliable.
arXiv Detail & Related papers (2022-06-28T08:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.