Related papers: Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms

Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms

URL: http://arxiv.org/abs/2403.11778v1
Date: Mon, 18 Mar 2024 13:35:10 GMT
Title: Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms
Authors: Jonat John Mathew, Rakin Ahsan, Sae Furukawa, Jagdish Gautham Krishna Kumar, Huzaifa Pallan, Agamjeet Singh Padda, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan,
Abstract summary: Deepfake audio poses a rising threat in communication platforms, necessitating real-time detection for audio stream integrity. This study assesses the viability of employing static deepfake audio detection models in real-time communication platforms. Two deepfake audio detection models based on Resnet and LCNN architectures are implemented.
Score: 0.5850093728139567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deepfake audio poses a rising threat in communication platforms, necessitating real-time detection for audio stream integrity. Unlike traditional non-real-time approaches, this study assesses the viability of employing static deepfake audio detection models in real-time communication platforms. An executable software is developed for cross-platform compatibility, enabling real-time execution. Two deepfake audio detection models based on Resnet and LCNN architectures are implemented using the ASVspoof 2019 dataset, achieving benchmark performances compared to ASVspoof 2019 challenge baselines. The study proposes strategies and frameworks for enhancing these models, paving the way for real-time deepfake audio detection in communication platforms. This work contributes to the advancement of audio stream security, ensuring robust detection capabilities in dynamic, real-time communication scenarios.

Related papers

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries [23.83866791274789]
We propose a query-based framework for open-vocabulary SED guided by multi-modal queries.<n>DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors from text or audio prompts.<n>DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting.
arXiv Detail & Related papers (2025-07-22T08:24:01Z)
Real-Time Emergency Vehicle Siren Detection with Efficient CNNs on Embedded Hardware [0.26249027950824516]
We present a full-stack emergency vehicle siren detection system designed for real-time deployment on embedded hardware.<n>The proposed approach is based on E2PANNs, a fine-tuned convolutional neural network derived from EPANNs.<n>A remote WebSocket interface provides real-time monitoring and facilitates live demonstration capabilities.
arXiv Detail & Related papers (2025-07-02T10:27:41Z)
From Large-scale Audio Tagging to Real-Time Explainable Emergency Vehicle Sirens Detection [0.26249027950824516]
This work introduces E2PANNs (Efficient Emergency Pre trained Audio Neural Networks), a lightweight Convolutional Neural Network architecture for binary EV siren detection.<n>We fine-tune and evaluate E2PANNs across multiple reference datasets and test its viability on embedded hardware.<n>Results demonstrate that E2PANNs establish a new state of the art in this research domain, with high computational efficiency, and suitability for edge-based audio monitoring and safety-critical applications.
arXiv Detail & Related papers (2025-06-30T00:21:07Z)
End-to-end Audio Deepfake Detection from RAW Waveforms: a RawNet-Based Approach with Cross-Dataset Evaluation [8.11594945165255]
We propose an end-to-end deep learning framework for audio deepfake detection that operates directly on raw waveforms. Our model, RawNetLite, is a lightweight convolutional-recurrent architecture designed to capture both spectral and temporal features without handcrafted preprocessing.
arXiv Detail & Related papers (2025-04-29T16:38:23Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant [11.136112399898481]
We introduce the ETD dataset, the first public dataset for end-turn detection. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low.
arXiv Detail & Related papers (2025-03-30T13:34:23Z)
Efficient Streaming Voice Steganalysis in Challenging Detection Scenarios [13.049308869863248]
This paper introduces a Dual-View VoIP Steganalysis Framework (DVSF) The framework randomly obfuscates parts of the native steganographic descriptors in VoIP stream segments. It then captures fine-grained local features related to steganography, building on the global features of VoIP.
arXiv Detail & Related papers (2024-11-20T02:22:58Z)
DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization [13.840950434728533]
We present a novel audio-visual deepfake detection framework. Based on the assumption that in real samples - in contrast to deepfakes - visual and audio signals coincide in terms of information. We use features from deep networks that specialize in video and audio speech recognition to spot frame-level cross-modal incongruities.
arXiv Detail & Related papers (2024-11-15T13:47:33Z)
STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking [8.238662377845142]
We present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.
arXiv Detail & Related papers (2024-10-08T12:15:17Z)
SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z)
Proactive Detection of Voice Cloning with Localized Watermarking [50.13539630769929]
We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics.
arXiv Detail & Related papers (2024-01-30T18:56:22Z)
Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side. To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
End-To-End Audiovisual Feature Fusion for Active Speaker Detection [7.631698269792165]
This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.
arXiv Detail & Related papers (2022-07-27T10:25:59Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.