Related papers: GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

URL: http://arxiv.org/abs/2406.16531v1
Date: Mon, 24 Jun 2024 11:10:41 GMT
Title: GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization
Authors: Yirui Chen, Xudong Huang, Quan Zhang, Wei Li, Mingjian Zhu, Qiangyu Yan, Simiao Li, Hanting Chen, Hailin Hu, Jie Yang, Wei Liu, Jie Hu,
Abstract summary: Local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. The GIM dataset has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images. We propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module.
Score: 21.846935203845728
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL). However, the lack of a large-scale data foundation makes IMDL task unattainable. In this paper, a local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. Upon this basis, We propose the GIM dataset, which has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images. 2) Rich Image Content, encompassing a broad range of image classes 3) Diverse Generative Manipulation, manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce two benchmark settings to evaluate the generalization capability and comprehensive performance of baseline methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses previous state-of-the-art works significantly on two different benchmarks.

Related papers

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer [32.9330637921386]
LAMIC is a Layout-Aware Multi-Image Composition framework.<n>It extends single-reference diffusion models to multi-reference scenarios in a training-free manner.<n>It consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings.
arXiv Detail & Related papers (2025-08-01T09:51:54Z)
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models [79.59567114769513]
We introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. Our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models.
arXiv Detail & Related papers (2025-01-10T07:56:23Z)
MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing [12.491684385808902]
MMO-IG is designed to generate RS images with supervised object labels from global and local aspects simultaneously. Considering the complex interdependencies among MMOs, we construct a spatial-cross dependency knowledge graph. Our MMO-IG exhibits superior generation capabilities for RS images with dense MMO-supervised labels.
arXiv Detail & Related papers (2024-12-18T10:19:12Z)
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models [103.25208095165486]
Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. We present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy.
arXiv Detail & Related papers (2024-12-09T21:44:02Z)
Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications [3.7636375810345744]
Large Language Models (LLMs) have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations. Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images. We describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain.
arXiv Detail & Related papers (2024-10-29T11:03:31Z)
Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning. Our key idea involves challenging the model to discern both matching and distinct elements. We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective. Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z)
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs) In particular, we study the importance of various architecture components and data choices. We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z)
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs [48.269363759989915]
The research focuses on two aspects: first, image-to-image matching, and second, multi-image-to-text matching. We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL.
arXiv Detail & Related papers (2024-01-05T00:26:07Z)
PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation Models Through Prompt Tuning [35.39822183728463]
We present a novel Prompt-IML framework for detecting tampered images. Humans tend to discern authenticity of an image based on semantic and high-frequency information. Our model can achieve better performance on eight typical fake image datasets.
arXiv Detail & Related papers (2024-01-01T03:45:07Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.