AutoSplice: A Text-prompt Manipulated Image Dataset for Media Forensics
- URL: http://arxiv.org/abs/2304.06870v1
- Date: Fri, 14 Apr 2023 00:14:08 GMT
- Title: AutoSplice: A Text-prompt Manipulated Image Dataset for Media Forensics
- Authors: Shan Jia, Mingzhen Huang, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu
- Abstract summary: This paper aims to investigate the level of challenge that language-image generation models pose to media forensics.
We propose a new approach that leverages the DALL-E2 language-image model to automatically generate and splice masked regions guided by a text prompt.
This approach has resulted in the creation of a new image dataset called AutoSplice, containing 5,894 manipulated and authentic images.
- Score: 31.714342131823987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in language-image models have led to the development of
highly realistic images that can be generated from textual descriptions.
However, the increased visual quality of these generated images poses a
potential threat to the field of media forensics. This paper aims to
investigate the level of challenge that language-image generation models pose
to media forensics. To achieve this, we propose a new approach that leverages
the DALL-E2 language-image model to automatically generate and splice masked
regions guided by a text prompt. To ensure the creation of realistic
manipulations, we have designed an annotation platform with human checking to
verify reasonable text prompts. This approach has resulted in the creation of a
new image dataset called AutoSplice, containing 5,894 manipulated and authentic
images. Specifically, we have generated a total of 3,621 images by locally or
globally manipulating real-world image-caption pairs, which we believe will
provide a valuable resource for developing generalized detection methods in
this area. The dataset is evaluated under two media forensic tasks: forgery
detection and localization. Our extensive experiments show that most media
forensic models struggle to detect the AutoSplice dataset as an unseen
manipulation. However, when fine-tuned models are used, they exhibit improved
performance in both tasks.
Related papers
- Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - ObjectFormer for Image Manipulation Detection and Localization [118.89882740099137]
We propose ObjectFormer to detect and localize image manipulations.
We extract high-frequency features of the images and combine them with RGB features as multimodal patch embeddings.
We conduct extensive experiments on various datasets and the results verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-03-28T12:27:34Z) - NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched.
We introduce several strategies for automatic retrieval of suitable images for the given captions.
Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.