Sensitive Image Classification by Vision Transformers
- URL: http://arxiv.org/abs/2412.16446v1
- Date: Sat, 21 Dec 2024 02:34:24 GMT
- Title: Sensitive Image Classification by Vision Transformers
- Authors: Hanxian He, Campbell Wilson, Thanh Thi Nguyen, Janis Dalins,
- Abstract summary: Vision transformer models leverage a self-attention mechanism to capture global interactions among contextual local elements.
In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models.
The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities.
- Score: 1.9598097298813262
- License:
- Abstract: When it comes to classifying child sexual abuse images, managing similar inter-class correlations and diverse intra-class correlations poses a significant challenge. Vision transformer models, unlike conventional deep convolutional network models, leverage a self-attention mechanism to capture global interactions among contextual local elements. This allows them to navigate through image patches effectively, avoiding incorrect correlations and reducing ambiguity in attention maps, thus proving their efficacy in computer vision tasks. Rather than directly analyzing child sexual abuse data, we constructed two datasets: one comprising clean and pornographic images and another with three classes, which additionally include images indicative of pornography, sourced from Reddit and Google Open Images data. In our experiments, we also employ an adult content image benchmark dataset. These datasets served as a basis for assessing the performance of vision transformer models in pornographic image classification. In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models. Furthermore, we compared them with established methods for sensitive image detection such as attention and metric learning based CNN and Bumble. The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities in this task.
Related papers
- Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - A Comprehensive Study of Vision Transformers in Image Classification
Tasks [0.46040036610482665]
We conduct a comprehensive survey of existing papers on Vision Transformers for image classification.
We first introduce the popular image classification datasets that influenced the design of models.
We present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks.
arXiv Detail & Related papers (2023-12-02T21:38:16Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Exploring Vision Transformers for Fine-grained Classification [0.0]
We propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes.
We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology.
arXiv Detail & Related papers (2021-06-19T23:57:31Z) - A Comparison for Anti-noise Robustness of Deep Learning Classification
Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to
Visual Transformer and Performer [27.023667473278266]
We first briefly review the development of Convolutional Neural Network and Visual Transformer in deep learning.
We then use various models of Convolutional Neural Network and Visual Transformer to conduct a series of experiments on the image dataset of tiny objects.
We discuss the problems in the classification of tiny objects and make a prospect for the classification of tiny objects in the future.
arXiv Detail & Related papers (2021-06-03T15:28:17Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Counterfactual Generative Networks [59.080843365828756]
We propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision.
By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background.
We show that the counterfactual images can improve out-of-distribution with a marginal drop in performance on the original classification task.
arXiv Detail & Related papers (2021-01-15T10:23:12Z) - Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations [71.00754846434744]
We show that imperceptible additive perturbations can significantly alter the disparity map.
We show that, when used for adversarial data augmentation, our perturbations result in trained models that are more robust.
arXiv Detail & Related papers (2020-09-21T19:20:09Z) - Unsupervised Deep Metric Learning with Transformed Attention Consistency
and Contrastive Clustering Loss [28.17607283348278]
Existing approaches for unsupervised metric learning focus on exploring self-supervision information within the input image itself.
We observe that, when analyzing images, human eyes often compare images against each other instead of examining images individually.
We develop a new approach to unsupervised deep metric learning where the network is learned based on self-supervision information across images.
arXiv Detail & Related papers (2020-08-10T19:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.