LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention
- URL: http://arxiv.org/abs/2505.07734v1
- Date: Mon, 12 May 2025 16:42:19 GMT
- Title: LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention
- Authors: Jiangling Zhang, Weijie Zhu, Jirui Huang, Yaxiong Chen,
- Abstract summary: We introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection.<n>LAMM-ViT integrates Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer.<n>In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC and 98.62% mean AP.
- Score: 4.0810988694972385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.
Related papers
- Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection [67.84730634802204]
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management.<n>Most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions.<n>We observe that frequency-domain feature modeling particularly in the wavelet domain amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain.
arXiv Detail & Related papers (2025-08-07T11:14:16Z) - NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning [1.7603474309877931]
NexViTAD is a cross-domain anomaly detection framework based on vision foundation models.<n>It addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms.<n>It delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains.
arXiv Detail & Related papers (2025-07-10T09:29:26Z) - Generalizable Multispectral Land Cover Classification via Frequency-Aware Mixture of Low-Rank Token Experts [22.75047167955269]
We introduce Land-MoE, a novel approach for multispectral land cover classification (MLCC)<n>Land-MoE comprises two key modules: the mixture of low-rank token experts (MoLTE) and frequency-aware filters (FAF)
arXiv Detail & Related papers (2025-05-20T08:52:28Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection [4.908389661988192]
HFMF is a comprehensive two-stage deepfake detection framework.<n>It integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism.<n>We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks.
arXiv Detail & Related papers (2025-01-10T00:20:29Z) - Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation [37.79819260918366]
Continual Test-Time Adaptation (CTTA) aims to adapt the pre-trained model to ever-evolving target domains.
We explore the integration of a Mixture-of-Activation-Sparsity-Experts (MoASE) as an adapter for the CTTA task.
arXiv Detail & Related papers (2024-05-26T08:51:39Z) - Fiducial Focus Augmentation for Facial Landmark Detection [4.433764381081446]
We propose a novel image augmentation technique to enhance the model's understanding of facial structures.
We employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss.
Our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
arXiv Detail & Related papers (2024-02-23T01:34:00Z) - GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning [50.7702397913573]
The rapid advancement of photorealistic generators has reached a critical juncture where the discrepancy between authentic and manipulated images is increasingly indistinguishable.
Although there have been a number of publicly available face forgery datasets, the forgery faces are mostly generated using GAN-based synthesis technology.
We propose a large-scale, diverse, and fine-grained high-fidelity dataset, namely GenFace, to facilitate the advancement of deepfake detection.
arXiv Detail & Related papers (2024-02-03T03:13:50Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [80.16624587948368]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.<n>We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.<n>Various STMs are integrated into this unified framework for comprehensive comparative analysis.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Calibrated Hyperspectral Image Reconstruction via Graph-based
Self-Tuning Network [40.71031760929464]
Hyperspectral imaging (HSI) has attracted increasing research attention, especially for the ones based on a coded snapshot spectral imaging (CASSI) system.
Existing deep HSI reconstruction models are generally trained on paired data to retrieve original signals upon 2D compressed measurements given by a particular optical hardware mask in CASSI.
This mask-specific training style will lead to a hardware miscalibration issue, which sets up barriers to deploying deep HSI models among different hardware and noisy environments.
We propose a novel Graph-based Self-Tuning ( GST) network to reason uncertainties adapting to varying spatial structures of masks among
arXiv Detail & Related papers (2021-12-31T09:39:13Z) - Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image
Reconstruction [127.20208645280438]
Hyperspectral image (HSI) reconstruction aims to recover the 3D spatial-spectral signal from a 2D measurement.
Modeling the inter-spectra interactions is beneficial for HSI reconstruction.
Mask-guided Spectral-wise Transformer (MST) proposes a novel framework for HSI reconstruction.
arXiv Detail & Related papers (2021-11-15T16:59:48Z) - Semantic Change Detection with Asymmetric Siamese Networks [71.28665116793138]
Given two aerial images, semantic change detection aims to locate the land-cover variations and identify their change types with pixel-wise boundaries.
This problem is vital in many earth vision related tasks, such as precise urban planning and natural resource management.
We present an asymmetric siamese network (ASN) to locate and identify semantic changes through feature pairs obtained from modules of widely different structures.
arXiv Detail & Related papers (2020-10-12T13:26:30Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.