Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces
- URL: http://arxiv.org/abs/2505.16253v1
- Date: Thu, 22 May 2025 05:43:40 GMT
- Title: Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces
- Authors: Preeti Mehta, Aman Sagar, Suchi Kumari,
- Abstract summary: This study aims to address the challenge of distinguishing computer-generated imagery (CGI) from authentic digital images.<n>It proposes a Swin Transformer based model for accurate differentiation between natural and synthetic images.<n>The model's performance was tested across all color schemes, with the RGB color scheme yielding the highest accuracy for each dataset.
- Score: 1.024113475677323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aims to address the growing challenge of distinguishing computer-generated imagery (CGI) from authentic digital images across three different color spaces; RGB, YCbCr, and HSV. Given the limitations of existing classification methods in handling the complexity and variability of CGI, this research proposes a Swin Transformer based model for accurate differentiation between natural and synthetic images. The proposed model leverages the Swin Transformer's hierarchical architecture to capture local and global features for distinguishing CGI from natural images. Its performance was assessed through intra- and inter-dataset testing across three datasets: CiFAKE, JSSSTU, and Columbia. The model was evaluated individually on each dataset (D1, D2, D3) and on the combined datasets (D1+D2+D3) to test its robustness and domain generalization. To address dataset imbalance, data augmentation techniques were applied. Additionally, t-SNE visualization was used to demonstrate the feature separability achieved by the Swin Transformer across the selected color spaces. The model's performance was tested across all color schemes, with the RGB color scheme yielding the highest accuracy for each dataset. As a result, RGB was selected for domain generalization analysis and compared with other CNN-based models, VGG-19 and ResNet-50. The comparative results demonstrate the proposed model's effectiveness in detecting CGI, highlighting its robustness and reliability in both intra-dataset and inter-dataset evaluations. The findings of this study highlight the Swin Transformer model's potential as an advanced tool for digital image forensics, particularly in distinguishing CGI from natural images. The model's strong performance indicates its capability for domain generalization, making it a valuable asset in scenarios requiring precise and reliable image classification.
Related papers
- Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis [0.0]
This study proposes a deep learning framework for multi-class skin disease classification.<n>We benchmark the performance of pre-trained convolutional neural networks (DenseNet201, Efficient-NetB5) and transformer-based models (ViT, Swin Transformer, DinoV2 Large)<n>Results show that DinoV2 with RGB pre-processing achieves the highest accuracy (up to 93%) and F1-scores across all variants.
arXiv Detail & Related papers (2025-08-06T15:57:49Z) - AugmentGest: Can Random Data Cropping Augmentation Boost Gesture Recognition Performance? [49.64902130083662]
This paper proposes a comprehensive data augmentation framework that integrates geometric transformations, random variations, rotation, zooming and intensity-based transformations.<n>The proposed augmentation strategy is evaluated on three models: multi-stream e2eET, FPPR point cloud-based hand gesture recognition (HGR), and DD-Network.
arXiv Detail & Related papers (2025-06-08T16:43:05Z) - Semantic Scene Completion with Multi-Feature Data Balancing Network [5.3431413737671525]
We propose a dual-head model for RGB and depth data (F-TSDF) inputs.<n>Our hybrid encoder-decoder architecture with identity transformation in a pre-activation residual module effectively manages diverse signals within F-TSDF.<n>We evaluate RGB feature fusion strategies and use a combined loss function cross entropy for 2D RGB features and weighted cross-entropy for 3D SSC predictions.
arXiv Detail & Related papers (2024-12-02T12:12:21Z) - Swin Transformer for Robust Differentiation of Real and Synthetic Images: Intra- and Inter-Dataset Analysis [0.0]
This study proposes a Swin Transformer-based model for accurate differentiation between natural and synthetic images.
The model's performance was evaluated through intra-dataset and inter-dataset testing across three distinct datasets.
arXiv Detail & Related papers (2024-09-07T06:43:17Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers [48.74331852418905]
Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model.<n>Due to this task's complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging.<n>We introduce a set of methods enabling cross-domain and cross-dimension learning for image-to-graph transformers.
arXiv Detail & Related papers (2024-03-11T10:48:56Z) - T-ADAF: Adaptive Data Augmentation Framework for Image Classification
Network based on Tensor T-product Operator [0.0]
This paper proposes an Adaptive Data Augmentation Framework based on the tensor T-product Operator.
It triples one image data to be trained and gain the result from all these three images together with only less than 0.1% increase in the number of parameters.
Numerical experiments show that our data augmentation framework can improve the performance of original neural network model by 2%.
arXiv Detail & Related papers (2023-06-07T08:30:44Z) - DCN-T: Dual Context Network with Transformer for Hyperspectral Image
Classification [109.09061514799413]
Hyperspectral image (HSI) classification is challenging due to spatial variability caused by complex imaging conditions.
We propose a tri-spectral image generation pipeline that transforms HSI into high-quality tri-spectral images.
Our proposed method outperforms state-of-the-art methods for HSI classification.
arXiv Detail & Related papers (2023-04-19T18:32:52Z) - Unsupervised Domain Adaptation with Histogram-gated Image Translation
for Delayered IC Image Analysis [2.720699926154399]
Histogram-gated Image Translation (HGIT) is an unsupervised domain adaptation framework which transforms images from a given source dataset to the domain of a target dataset.
Our method achieves the best performance compared to the reported domain adaptation techniques, and is also reasonably close to the fully supervised benchmark.
arXiv Detail & Related papers (2022-09-27T15:53:22Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Multi-Spectral Image Synthesis for Crop/Weed Segmentation in Precision
Farming [3.4788711710826083]
We propose an alternative solution with respect to the common data augmentation methods, applying it to the problem of crop/weed segmentation in precision farming.
We create semi-artificial samples by replacing the most relevant object classes (i.e., crop and weeds) with their synthesized counterparts.
In addition to RGB data, we take into account also near-infrared (NIR) information, generating four channel multi-spectral synthetic images.
arXiv Detail & Related papers (2020-09-12T08:49:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.