ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training
- URL: http://arxiv.org/abs/2410.04032v1
- Date: Sat, 5 Oct 2024 04:41:55 GMT
- Title: ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training
- Authors: Weihuang Liu, Xi Shen, Chi-Man Pun, Xiaodong Cun,
- Abstract summary: Social media is increasingly plagued by realistic fake images, making it hard to trust content.
Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets.
We introduce ForgeryTTT, the first method leveraging test-time training to identify manipulated regions in images.
- Score: 42.58645429356456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques. Our code and data will be released upon publication.
Related papers
- Towards Generic Image Manipulation Detection with Weakly-Supervised
Self-Consistency Learning [49.43362803584032]
We propose weakly-supervised image manipulation detection.
Such a setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques.
Two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC)
arXiv Detail & Related papers (2023-09-03T19:19:56Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - ST-KeyS: Self-Supervised Transformer for Keyword Spotting in Historical
Handwritten Documents [3.9688530261646653]
Keywords spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections.
We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm.
In the fine-tuning stage, the pre-trained encoder is integrated into a siamese neural network model that is fine-tuned to improve feature embedding from the input images.
arXiv Detail & Related papers (2023-03-06T13:39:41Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - Semantic-aware Dense Representation Learning for Remote Sensing Image
Change Detection [20.761672725633936]
Training deep learning-based change detection model heavily depends on labeled data.
Recent trend is using remote sensing (RS) data to obtain in-domain representations via supervised or self-supervised learning (SSL)
We propose dense semantic-aware pre-training for RS image CD via sampling multiple class-balanced points.
arXiv Detail & Related papers (2022-05-27T06:08:33Z) - MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP)
We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image.
MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.