HM-Conformer: A Conformer-based audio deepfake detection system with
hierarchical pooling and multi-level classification token aggregation methods
- URL: http://arxiv.org/abs/2309.08208v1
- Date: Fri, 15 Sep 2023 07:18:30 GMT
- Title: HM-Conformer: A Conformer-based audio deepfake detection system with
hierarchical pooling and multi-level classification token aggregation methods
- Authors: Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, and
Ha-Jin Yu
- Abstract summary: HM-Conformer is designed for sequence-to-sequence tasks.
It can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them.
In experimental results, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.
- Score: 34.83806360076228
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Audio deepfake detection (ADD) is the task of detecting spoofing attacks
generated by text-to-speech or voice conversion systems. Spoofing evidence,
which helps to distinguish between spoofed and bona-fide utterances, might
exist either locally or globally in the input features. To capture these, the
Conformer, which consists of Transformers and CNN, possesses a suitable
structure. However, since the Conformer was designed for sequence-to-sequence
tasks, its direct application to ADD tasks may be sub-optimal. To tackle this
limitation, we propose HM-Conformer by adopting two components: (1)
Hierarchical pooling method progressively reducing the sequence length to
eliminate duplicated information (2) Multi-level classification token
aggregation method utilizing classification tokens to gather information from
different blocks. Owing to these components, HM-Conformer can efficiently
detect spoofing evidence by processing various sequence lengths and aggregating
them. In experimental results on the ASVspoof 2021 Deepfake dataset,
HM-Conformer achieved a 15.71% EER, showing competitive performance compared to
recent systems.
Related papers
- Dual DETRs for Multi-Label Temporal Action Detection [46.05173000284639]
Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos.
We propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level.
We evaluate DualDETR on three challenging multi-label TAD benchmarks.
arXiv Detail & Related papers (2024-03-31T11:43:39Z) - Semi-DETR: Semi-Supervised Object Detection with Detection Transformers [105.45018934087076]
We analyze the DETR-based framework on semi-supervised object detection (SSOD)
We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector.
Our method outperforms all state-of-the-art methods by clear margins.
arXiv Detail & Related papers (2023-07-16T16:32:14Z) - Synthetic Voice Detection and Audio Splicing Detection using
SE-Res2Net-Conformer Architecture [2.9805017559176883]
This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features.
Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance.
This paper also proposes to re-formulate the existing audio splicing detection problem.
arXiv Detail & Related papers (2022-10-07T14:30:13Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding.
However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold.
We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - EHSOD: CAM-Guided End-to-end Hybrid-Supervised Object Detection with
Cascade Refinement [53.69674636044927]
We present EHSOD, an end-to-end hybrid-supervised object detection system.
It can be trained in one shot on both fully and weakly-annotated data.
It achieves comparable results on multiple object detection benchmarks with only 30% fully-annotated data.
arXiv Detail & Related papers (2020-02-18T08:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.