Related papers: PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation Models Through Prompt Tuning

PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation Models Through Prompt Tuning

URL: http://arxiv.org/abs/2401.00653v1
Date: Mon, 1 Jan 2024 03:45:07 GMT
Title: PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation Models Through Prompt Tuning
Authors: Xuntao Liu, Yuzhou Yang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang and Sheng Li
Abstract summary: We present a novel Prompt-IML framework for detecting tampered images. Humans tend to discern authenticity of an image based on semantic and high-frequency information. Our model can achieve better performance on eight typical fake image datasets.
Score: 35.39822183728463
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deceptive images can be shared in seconds with social networking services, posing substantial risks. Tampering traces, such as boundary artifacts and high-frequency information, have been significantly emphasized by massive networks in the Image Manipulation Localization (IML) field. However, they are prone to image post-processing operations, which limit the generalization and robustness of existing methods. We present a novel Prompt-IML framework. We observe that humans tend to discern the authenticity of an image based on both semantic and high-frequency information, inspired by which, the proposed framework leverages rich semantic knowledge from pre-trained visual foundation models to assist IML. We are the first to design a framework that utilizes visual foundation models specially for the IML task. Moreover, we design a Feature Alignment and Fusion module to align and fuse features of semantic features with high-frequency features, which aims at locating tampered regions from multiple perspectives. Experimental results demonstrate that our model can achieve better performance on eight typical fake image datasets and outstanding robustness.

Related papers

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models [37.63573703440172]
Vision-language models (VLMs) pre-trained on natural image and language data, such as CLIP, have exhibited significant potential in few-shot image recognition tasks.<n>We propose Consistency-guided Multi-view Collaborative Optimization (CoMuCo), a novel fine-tuning strategy for VLMs.
arXiv Detail & Related papers (2025-08-18T12:00:09Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models [50.98559225639266]
Sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability. Global Semantic-guided Weight Allocator (GSWA) module allocates weights to sub-images based on their relative information density. SleighVL, a lightweight yet high-performing model, outperforms models with comparable parameters and remains competitive with larger models.
arXiv Detail & Related papers (2025-01-24T06:42:06Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored. We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening [2.874893537471256]
Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches. In this paper, we propose a model-based deep unfolded method for satellite image fusion. Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2024-09-04T13:05:00Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise. MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z)
Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning [18.424840375721303]
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images. A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets. This study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM.
arXiv Detail & Related papers (2024-07-22T17:54:41Z)
VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [76.02314305164595]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users. We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image. In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z)
Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models. In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques. We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z)
CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks. Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches. We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks. After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z)
Generating Annotated High-Fidelity Images Containing Multiple Coherent Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information. We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.