Related papers: Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models

Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models

URL: http://arxiv.org/abs/2503.10718v1
Date: Thu, 13 Mar 2025 07:21:16 GMT
Title: Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models
Authors: Tsan-Tsung Yang, I-Wei Chen, Kuan-Ting Chen, Shang-Hsuan Chiang, Wen-Chih Peng,
Abstract summary: This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers.<n>For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors.<n>For CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification.
Score: 8.149084146016587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in https://github.com/uuugaga/Defactify_4

Related papers

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection [14.7077339945096]
NS-Net is a novel framework that decouples semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images.<n>Experiments show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4% improvement in detection accuracy.
arXiv Detail & Related papers (2025-08-02T07:58:15Z)
Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection [24.512663807403186]
InfoFD is a text-guided AI-generated image detection framework.<n>We introduce two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO)<n>Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models.
arXiv Detail & Related papers (2025-05-21T07:46:26Z)
DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images [14.448350657613368]
DeeCLIP is a novel framework for detecting AI-generated images. It incorporates DeeFuser, a fusion module that combines high-level and low-level features. We trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.90%.
arXiv Detail & Related papers (2025-04-28T15:06:28Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
A Sanity Check for AI-generated Image Detection [49.08585395873425]
We propose AIDE (AI-generated Image DEtector with Hybrid Features) to detect AI-generated images.<n>AIDE achieves +3.5% and +4.6% improvements to state-of-the-art methods.
arXiv Detail & Related papers (2024-06-27T17:59:49Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval [15.982022297570108]
Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing. Results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.
arXiv Detail & Related papers (2024-01-27T09:39:11Z)
Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z)
Creating Realistic Anterior Segment Optical Coherence Tomography Images using Generative Adversarial Networks [0.0]
Generative Adversarial Network (GAN) purposed to create high-resolution, realistic Anterior Segment Optical Coherence Tomography (AS- OCT) images. We trained the Style and WAvelet based GAN on 142,628 AS- OCT B-scans.
arXiv Detail & Related papers (2023-06-24T20:48:00Z)
CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images [7.868449549351487]
This article proposes to enhance our ability to recognise AI-generated images through computer vision. The two sets of data present as a binary classification problem with regard to whether the photograph is real or generated by AI. This study proposes the use of a Convolutional Neural Network (CNN) to classify the images into two categories; Real or Fake.
arXiv Detail & Related papers (2023-03-24T16:33:06Z)
Face Recognition in the age of CLIP & Billion image datasets [0.0]
We evaluate the performance of various CLIP models as zero-shot face recognizers. We also investigate the robustness of CLIP models against data poisoning attacks.
arXiv Detail & Related papers (2023-01-18T05:34:57Z)
Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains. We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights. The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z)
Robust Cross-Modal Representation Learning with Progressive Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets. We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z)
Image Quality Assessment using Contrastive Learning [50.265638572116984]
We train a deep Convolutional Neural Network (CNN) using a contrastive pairwise objective to solve the auxiliary problem. We show through extensive experiments that CONTRIQUE achieves competitive performance when compared to state-of-the-art NR image quality models. Our results suggest that powerful quality representations with perceptual relevance can be obtained without requiring large labeled subjective image quality datasets.
arXiv Detail & Related papers (2021-10-25T21:01:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.