Perceptual MAE for Image Manipulation Localization: A High-level Vision
Learner Focusing on Low-level Features
- URL: http://arxiv.org/abs/2310.06525v1
- Date: Tue, 10 Oct 2023 11:14:29 GMT
- Title: Perceptual MAE for Image Manipulation Localization: A High-level Vision
Learner Focusing on Low-level Features
- Authors: Xiaochen Ma, Jizhe Zhou, Xiong Xu, Zhuohang Jiang, Chi-Man Pun
- Abstract summary: We propose a method to enhance the Masked Autoencoder (MAE) by incorporating high-resolution inputs and a perceptual loss supervision module.
Based on such an interpretation, we propose a method to enhance the Masked Autoencoder (MAE) by incorporating high-resolution inputs and a perceptual loss supervision module.
- Score: 33.37376410890546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, multimedia forensics faces unprecedented challenges due to the
rapid advancement of multimedia generation technology thereby making Image
Manipulation Localization (IML) crucial in the pursuit of truth. The key to IML
lies in revealing the artifacts or inconsistencies between the tampered and
authentic areas, which are evident under pixel-level features. Consequently,
existing studies treat IML as a low-level vision task, focusing on allocating
tampered masks by crafting pixel-level features such as image RGB noises, edge
signals, or high-frequency features. However, in practice, tampering commonly
occurs at the object level, and different classes of objects have varying
likelihoods of becoming targets of tampering. Therefore, object semantics are
also vital in identifying the tampered areas in addition to pixel-level
features. This necessitates IML models to carry out a semantic understanding of
the entire image. In this paper, we reformulate the IML task as a high-level
vision task that greatly benefits from low-level features. Based on such an
interpretation, we propose a method to enhance the Masked Autoencoder (MAE) by
incorporating high-resolution inputs and a perceptual loss supervision module,
which is termed Perceptual MAE (PMAE). While MAE has demonstrated an impressive
understanding of object semantics, PMAE can also compensate for low-level
semantics with our proposed enhancements. Evidenced by extensive experiments,
this paradigm effectively unites the low-level and high-level features of the
IML task and outperforms state-of-the-art tampering localization methods on all
five publicly available datasets.
Related papers
- PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling [7.630967411418269]
We propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths.
Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features.
arXiv Detail & Related papers (2025-01-06T13:30:16Z) - ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.12958154544838]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning [18.424840375721303]
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images.
A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets.
This study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM.
arXiv Detail & Related papers (2024-07-22T17:54:41Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation
Models Through Prompt Tuning [35.39822183728463]
We present a novel Prompt-IML framework for detecting tampered images.
Humans tend to discern authenticity of an image based on semantic and high-frequency information.
Our model can achieve better performance on eight typical fake image datasets.
arXiv Detail & Related papers (2024-01-01T03:45:07Z) - Towards Granularity-adjusted Pixel-level Semantic Annotation [26.91350707156658]
GranSAM provides semantic segmentation at the user-defined granularity level on unlabeled data without the need for any manual supervision.
We accumulate semantic information from synthetic images generated by the Stable Diffusion model or web crawled images.
We conducted experiments on the PASCAL VOC 2012 and COCO-80 datasets and observed a +17.95% and +5.17% increase in mIoU.
arXiv Detail & Related papers (2023-12-05T01:37:18Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.