PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation
Models Through Prompt Tuning
- URL: http://arxiv.org/abs/2401.00653v1
- Date: Mon, 1 Jan 2024 03:45:07 GMT
- Title: PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation
Models Through Prompt Tuning
- Authors: Xuntao Liu, Yuzhou Yang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang and
Sheng Li
- Abstract summary: We present a novel Prompt-IML framework for detecting tampered images.
Humans tend to discern authenticity of an image based on semantic and high-frequency information.
Our model can achieve better performance on eight typical fake image datasets.
- Score: 35.39822183728463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deceptive images can be shared in seconds with social networking services,
posing substantial risks. Tampering traces, such as boundary artifacts and
high-frequency information, have been significantly emphasized by massive
networks in the Image Manipulation Localization (IML) field. However, they are
prone to image post-processing operations, which limit the generalization and
robustness of existing methods. We present a novel Prompt-IML framework. We
observe that humans tend to discern the authenticity of an image based on both
semantic and high-frequency information, inspired by which, the proposed
framework leverages rich semantic knowledge from pre-trained visual foundation
models to assist IML. We are the first to design a framework that utilizes
visual foundation models specially for the IML task. Moreover, we design a
Feature Alignment and Fusion module to align and fuse features of semantic
features with high-frequency features, which aims at locating tampered regions
from multiple perspectives. Experimental results demonstrate that our model can
achieve better performance on eight typical fake image datasets and outstanding
robustness.
Related papers
- MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.
Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.
We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z) - Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening [2.874893537471256]
Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches.
In this paper, we propose a model-based deep unfolded method for satellite image fusion.
Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2024-09-04T13:05:00Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise.
MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains.
Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z) - Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning [18.424840375721303]
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images.
A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets.
This study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM.
arXiv Detail & Related papers (2024-07-22T17:54:41Z) - VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [76.02314305164595]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users.
We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image.
In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - Generating Annotated High-Fidelity Images Containing Multiple Coherent
Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information.
We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.