Related papers: Faked Speech Detection with Zero Prior Knowledge

Faked Speech Detection with Zero Prior Knowledge

URL: http://arxiv.org/abs/2209.12573v6
Date: Tue, 2 Apr 2024 07:58:15 GMT
Title: Faked Speech Detection with Zero Prior Knowledge
Authors: Sahar Al Ajmi, Khizar Hayat, Alaa M. Al Obaidi, Naresh Kumar, Munaf Najmuldeen, Baptiste Magnier,
Abstract summary: We introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked. We propose a deep neural network following a sequential model that comprises three hidden layers, with alternating dense and drop out layers. We were able to get at least 94% correct classification of the test cases, as against the 85% accuracy in the case of human observers.
Score: 2.407976495888858
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio is one of the most used ways of human communication, but at the same time it can be easily misused to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone, thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word 'blindly' refers to the ability to detect mimicked audio without references or real sources. We propose a deep neural network following a sequential model that comprises three hidden layers, with alternating dense and drop out layers. The proposed model was trained on a set of 26 important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. The data was extracted from two raw datasets, especially composed for this work; an all English dataset and a mixed dataset (Arabic plus English) (The dataset can be provided, in raw form, by writing an email to the first author). For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy, as we were able to get at least 94% correct classification of the test cases, as against the 85% accuracy in the case of human observers.

Related papers

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation [65.7990140284317]
We focus on object grounding, i.e., localizing an object of interest in a visual scene based on verbal human instructions.<n>To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions.<n>Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods.
arXiv Detail & Related papers (2025-11-27T02:00:28Z)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z)
Learning Audio Concepts from Counterfactual Natural Language [34.118579918018725]
This study introduces causal reasoning and counterfactual analysis in the audio domain. Our model considers acoustic characteristics and sound source information from human-annotated reference texts. Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
arXiv Detail & Related papers (2024-01-10T05:15:09Z)
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting. When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances. Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z)
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z)
Audio Deepfake Attribution: An Initial Dataset and Investigation [41.62487394875349]
We design the first deepfake audio dataset for the attribution of audio generation tools, called Audio Deepfake Attribution (ADA) We propose the Class- Multi-Center Learning ( CRML) method for open-set audio deepfake attribution (OSADA) Experimental results demonstrate that the CRML method effectively addresses open-set risks in real-world scenarios.
arXiv Detail & Related papers (2022-08-21T05:15:40Z)
Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity. Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z)
Separate What You Describe: Language-Queried Audio Source Separation [53.65665794338574]
We introduce the task of language-queried audio source separation (LASS) LASS aims to separate a target source from an audio mixture based on a natural language query of the target source. We propose LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information.
arXiv Detail & Related papers (2022-03-28T23:47:57Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Few Shot Text-Independent speaker verification using 3D-CNN [0.0]
We have proposed a novel method to verify the identity of the claimed speaker using very few training data. Experiments conducted on the VoxCeleb1 dataset show that the proposed model accuracy even on training with very few data is near to the state of the art model on text-independent speaker verification.
arXiv Detail & Related papers (2020-08-25T15:03:29Z)
Unsupervised Learning of Audio Perception for Robotics Applications: Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data. We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z)
AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark [12.034688724153044]
This paper explores post-hoc explanations for deep neural networks in the audio domain. We present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits. We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.
arXiv Detail & Related papers (2018-07-09T23:11:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.