Related papers: FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

URL: http://arxiv.org/abs/2409.03109v1
Date: Thu, 22 Aug 2024 15:41:56 GMT
Title: FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid,
Abstract summary: FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. It exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images. FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47%.
Score: 14.448350657613368
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at https://github.com/Mamadou-Keita/FIDAVL.

Related papers

Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models [29.571937393873444]
Cross-modal contrastive learning (CLIP) methods suffer from suboptimal visual representation capabilities.<n>We propose ALTA (ALign Through Adapting), an efficient vision-language alignment method that utilizes only about 8% of the trainable parameters.<n>ALTA superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling.
arXiv Detail & Related papers (2025-06-10T17:02:27Z)
FLIP Reasoning Challenge [20.706469085872516]
This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks. FLIP challenges present users with two orderings of 4 images, requiring them to identify the coherent one. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs)
arXiv Detail & Related papers (2025-04-16T17:07:16Z)
Appeal prediction for AI up-scaled Images [45.61706071739717]
We describe our developed dataset, which uses 136 base images and five different up-scaling methods. We evaluate the appeal of the different methods, and the results indicate that Real-ESRGAN and BSRGAN are the best. In addition to this, we evaluate state-of-the-art image appeal and quality models, here none of the models showed a high prediction performance.
arXiv Detail & Related papers (2025-02-19T13:45:24Z)
Visual Perception in Text Strings [24.60102607739684]
In this work, we select ASCII art as a representative artifact, where the lines and brightness used to depict each concept are rendered by characters. We benchmark model performance on this task by constructing an evaluation dataset and also collect a training set to elicit the models' visual perception ability. Results reveal that although humans can achieve nearly 100% accuracy, the state-of-the-art LLMs and MLLMs lag far behind.
arXiv Detail & Related papers (2024-10-02T16:46:01Z)
Accelerating Domain-Aware Electron Microscopy Analysis Using Deep Learning Models with Synthetic Data and Image-Wide Confidence Scoring [0.0]
We create a physics-based synthetic image and data generator, resulting in a machine learning model that achieves comparable precision (0.86), recall (0.63), F1 scores (0.71), and engineering property predictions (R2=0.82) Our study demonstrates that synthetic data can eliminate human reliance in ML and provides a means for domain awareness in cases where many feature detections per image are needed.
arXiv Detail & Related papers (2024-08-02T20:15:15Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Bi-LORA: A Vision-Language Approach for Synthetic Image Detection [14.448350657613364]
Deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs) have ushered in an era of generating highly realistic images. This paper takes inspiration from the potent convergence capabilities between vision and language, coupled with the zero-shot nature of vision-language models (VLMs) We introduce an innovative method called Bi-LORA that leverages VLMs, combined with low-rank adaptation (LORA) tuning techniques, to enhance the precision of synthetic image detection for unseen model-generated images.
arXiv Detail & Related papers (2024-04-02T13:54:22Z)
Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z)
Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models [0.09264362806173355]
Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
arXiv Detail & Related papers (2023-12-30T03:19:54Z)
Localized Symbolic Knowledge Distillation for Visual Commonsense Models [150.18129140140238]
We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model. We find that training on the localized commonsense corpus can successfully distill existing vision-language models to support a reference-as-input interface.
arXiv Detail & Related papers (2023-12-08T05:23:50Z)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)
Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models. Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.