Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation
- URL: http://arxiv.org/abs/2508.20987v1
- Date: Thu, 28 Aug 2025 16:44:40 GMT
- Title: Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation
- Authors: Chenfan Qu, Yiwu Zhong, Bin Li, Lianwen Jin,
- Abstract summary: Images manipulated using image editing tools can mislead viewers and pose significant risks to social security.<n>One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets.<n>We utilize a large collection of manually forged images from the web, as well as automatically generated annotations.<n>We construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations.
- Score: 49.83611963142304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.
Related papers
- Data Factory with Minimal Human Effort Using VLMs [35.30747487237989]
We introduce a training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels.<n>This approach eliminates the need for manual annotations and significantly improves downstream tasks.<n>Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
arXiv Detail & Related papers (2025-10-07T09:43:24Z) - Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model [92.61216319417208]
We propose a novel diffusion model (DM)-based framework, dubbed ours, for image deblurring.<n>ours performs DM to generate the prior knowledge that aids in recovering the textures of blurry images.<n>To fully exploit the generated texture priors, we present the Texture Transfer Transformer layer (TTformer)
arXiv Detail & Related papers (2025-07-18T01:50:31Z) - A Recipe for Improving Remote Sensing VLM Zero Shot Generalization [0.4427533728730559]
We present two novel image-caption datasets for training of remote sensing foundation models.<n>The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps.<n>The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain.
arXiv Detail & Related papers (2025-03-10T21:09:02Z) - EliGen: Entity-Level Controlled Image Generation with Regional Attention [7.7120747804211405]
We present EliGen, a novel framework for entity-level controlled image Generation.<n>We train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both spatial precision and image quality.<n>We propose an inpainting fusion pipeline, extending its capabilities to multi-entity image inpainting tasks.
arXiv Detail & Related papers (2025-01-02T06:46:13Z) - Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation [53.95204595640208]
Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data.
Previous approaches have generated synthetic images at high resolutions without leveraging information from real images.
MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features.
arXiv Detail & Related papers (2024-11-26T02:23:31Z) - Towards Small Object Editing: A Benchmark Dataset and A Training-Free Approach [13.262064234892282]
Small object generation has been limited due to difficulties in aligning cross-modal attention maps between text and these objects.
Our approach offers a training-free method that significantly mitigates this alignment issue with local and global attention guidance.
Preliminary results demonstrate the effectiveness of our method, showing marked improvements in the fidelity and accuracy of small object generation compared to existing models.
arXiv Detail & Related papers (2024-11-03T12:38:23Z) - xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details.
We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z) - HINT: High-quality INPainting Transformer with Mask-Aware Encoding and
Enhanced Attention [14.055584700641212]
Existing image inpainting methods leverage convolution-based downsampling approaches to reduce spatial dimensions.
We propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel mask-aware pixel-shuffle downsampling module.
We demonstrate the superior performance of HINT compared to contemporary state-of-the-art models on four datasets.
arXiv Detail & Related papers (2024-02-22T00:14:26Z) - IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer [25.673986942179123]
Advanced image tampering techniques are challenging the trustworthiness of multimedia.<n>What makes a good IML model? The answer lies in the way to capture artifacts.<n>We build a ViT paradigm IML-ViT, which has high-resolution capacity, multi-scale feature extraction capability, and manipulation edge supervision.<n>We term this simple but effective ViT paradigm IML-ViT, which has significant potential to become a new benchmark for IML.
arXiv Detail & Related papers (2023-07-27T13:49:27Z) - High-Quality Entity Segmentation [110.55724145851725]
CropFormer is designed to tackle the intractability of instance-level segmentation on high-resolution images.
It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image.
With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging entity segmentation task.
arXiv Detail & Related papers (2022-11-10T18:58:22Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.