TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents
- URL: http://arxiv.org/abs/2601.12895v1
- Date: Mon, 19 Jan 2026 09:50:51 GMT
- Title: TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents
- Authors: Chan Naseeb, Adeel Ashraf Cheema, Hassan Sami, Tayyab Afzal, Muhammad Omair, Usman Habib,
- Abstract summary: TwoHead-SwinFPN is a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents.<n>Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation.<n>Experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31% accuracy, 90.78% AUC for classification, and 57.24% mean Dice score for localization.
- Score: 0.4881924950569192
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.
Related papers
- Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition [7.632962062462334]
Zero-shot Handwritten Chinese Character Recognition aims to recognize unseen characters by leveraging radical-based semantic compositions.<n>We propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling.<n>Our method establishes new state-of-the-art performance, achieving an accuracy of 55.04% on the ICDAR 2013 dataset.
arXiv Detail & Related papers (2026-02-03T16:08:40Z) - AI Generated Text Detection [0.0]
This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures.<n>We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage.<n>Results indicate that contextual modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization.
arXiv Detail & Related papers (2026-01-07T11:18:10Z) - MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing [117.58619053719251]
MinerU2.5 is a document parsing model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency.<n>Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition.
arXiv Detail & Related papers (2025-09-26T10:45:48Z) - Deepfake Detection that Generalizes Across Benchmarks [48.85953407706351]
The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment.<n>This work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders.<n>The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC.
arXiv Detail & Related papers (2025-08-08T12:03:56Z) - AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification [51.525891360380285]
AHDMIL is an Asymmetric Hierarchical Distillation Multi-Instance Learning framework.<n>It eliminates irrelevant patches through a two-step training process.<n>It consistently outperforms previous state-of-the-art methods in both classification performance and inference speed.
arXiv Detail & Related papers (2025-08-07T07:47:16Z) - G-MSGINet: A Grouped Multi-Scale Graph-Involution Network for Contactless Fingerprint Recognition [20.458766184257147]
G-MSGINet is a unified framework for robust contactless fingerprint recognition.<n>It jointly performs minutiae localization and identity embedding directly from raw input images.<n>Extensive experiments on three benchmark datasets show G-MSGINet consistently achieves minutiae F1-scores in the range of $0.83pm0.02$ and Rank-1 identification accuracies between 97.0% and 99.1%.
arXiv Detail & Related papers (2025-05-13T05:24:24Z) - AWARE-NET: Adaptive Weighted Averaging for Robust Ensemble Network in Deepfake Detection [0.0]
We propose a novel two-tier ensemble framework for deepfake detection based on deep learning.<n>Our framework employs a unique approach where each architecture is instantiated three times.<n>Experiments achieved state-of-the-art intra-dataset performance.
arXiv Detail & Related papers (2025-05-01T05:14:50Z) - Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection [6.367999777464464]
multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting.
In this paper, we introduce the Straight-through Gumbel-Softmax framework, offering a comprehensive approach to search multimodal fusion model architectures.
Experiments on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4% achieved with minimal model parameters.
arXiv Detail & Related papers (2024-06-19T09:26:22Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - A^2-FPN: Attention Aggregation based Feature Pyramid Network for
Instance Segmentation [68.10621089649486]
We propose Attention Aggregation based Feature Pyramid Network (A2-FPN) to improve multi-scale feature learning.
A2-FPN achieves an improvement of 2.0% and 1.4% mask AP when integrated into the strong baselines such as Cascade Mask R-CNN and Hybrid Task Cascade.
arXiv Detail & Related papers (2021-05-07T11:51:08Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.