IML-ViT: Benchmarking Image Manipulation Localization by Vision
Transformer
- URL: http://arxiv.org/abs/2307.14863v3
- Date: Thu, 31 Aug 2023 13:25:59 GMT
- Title: IML-ViT: Benchmarking Image Manipulation Localization by Vision
Transformer
- Authors: Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y. Al Hammadi, Jizhe Zhou
- Abstract summary: Advanced image tampering techniques are challenging the trustworthiness of multimedia.
What makes a good IML model? The answer lies in the way to capture artifacts.
We term this simple but effective ViT paradigm IML-ViT, which has significant potential to become a new benchmark for IML.
- Score: 26.93638840931684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advanced image tampering techniques are increasingly challenging the
trustworthiness of multimedia, leading to the development of Image Manipulation
Localization (IML). But what makes a good IML model? The answer lies in the way
to capture artifacts. Exploiting artifacts requires the model to extract
non-semantic discrepancies between manipulated and authentic regions,
necessitating explicit comparisons between the two areas. With the
self-attention mechanism, naturally, the Transformer should be a better
candidate to capture artifacts. However, due to limited datasets, there is
currently no pure ViT-based approach for IML to serve as a benchmark, and CNNs
dominate the entire task. Nevertheless, CNNs suffer from weak long-range and
non-semantic modeling. To bridge this gap, based on the fact that artifacts are
sensitive to image resolution, amplified under multi-scale features, and
massive at the manipulation border, we formulate the answer to the former
question as building a ViT with high-resolution capacity, multi-scale feature
extraction capability, and manipulation edge supervision that could converge
with a small amount of data. We term this simple but effective ViT paradigm
IML-ViT, which has significant potential to become a new benchmark for IML.
Extensive experiments on five benchmark datasets verified our model outperforms
the state-of-the-art manipulation localization methods.Code and models are
available at \url{https://github.com/SunnyHaze/IML-ViT}.
Related papers
- Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks.
We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself.
We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z) - A Noise and Edge extraction-based dual-branch method for Shallowfake and Deepfake Localization [15.647035299476894]
We develop a dual-branch model that integrates manually designed feature noise with conventional CNN features.
The model is superior in comparison and easily outperforms the existing state-of-the-art (SoTA) models.
arXiv Detail & Related papers (2024-09-02T02:18:34Z) - Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector [15.647035299476894]
This publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer.
The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation.
It specifically focuses on improving the global texture module, which extracts feature map correlation.
arXiv Detail & Related papers (2024-08-29T20:26:27Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - Masked autoencoders are effective solution to transformer data-hungry [0.0]
Vision Transformers (ViTs) outperforms convolutional neural networks (CNNs) in several vision tasks with its global modeling capabilities.
ViT lacks the inductive bias inherent to convolution making it require a large amount of data for training.
Masked autoencoders (MAE) can make the transformer focus more on the image itself.
arXiv Detail & Related papers (2022-12-12T03:15:19Z) - Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition [24.406654146411682]
Vision Transformer (ViT) is the research base for this paper.
Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images.
We propose a weakly supervised object localization-based approach to extract multi-scale local features.
arXiv Detail & Related papers (2022-04-22T14:38:40Z) - Multimodal Fusion Transformer for Remote Sensing Image Classification [35.57881383390397]
Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs)
To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters.
We introduce a new multimodal fusion transformer (MFT) network which comprises a multihead cross patch attention (mCrossPA) for HSI land-cover classification.
arXiv Detail & Related papers (2022-03-31T11:18:41Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.