A Fusion Model for Artwork Identification Based on Convolutional Neural Networks and Transformers
- URL: http://arxiv.org/abs/2502.18083v3
- Date: Thu, 27 Feb 2025 02:18:08 GMT
- Title: A Fusion Model for Artwork Identification Based on Convolutional Neural Networks and Transformers
- Authors: Zhenyu Wang, Heng Song,
- Abstract summary: This paper proposes a fusion model combining CNNs and Transformers for identification artwork.<n>Experiments on Chinese and oil painting datasets show the fusion model outperforms individual CNN and Transformer models.
- Score: 6.57747694461617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The identification of artwork is crucial in areas like cultural heritage protection, art market analysis, and historical research. With the advancement of deep learning, Convolutional Neural Networks (CNNs) and Transformer models have become key tools for image classification. While CNNs excel in local feature extraction, they struggle with global context, and Transformers are strong in capturing global dependencies but weak in fine-grained local details. To address these challenges, this paper proposes a fusion model combining CNNs and Transformers for artwork identification. The model first extracts local features using CNNs, then captures global context with a Transformer, followed by a feature fusion mechanism to enhance classification accuracy. Experiments on Chinese and oil painting datasets show the fusion model outperforms individual CNN and Transformer models, improving classification accuracy by 9.7% and 7.1%, respectively, and increasing F1 scores by 0.06 and 0.05. The results demonstrate the model's effectiveness and potential for future improvements, such as multimodal integration and architecture optimization.
Related papers
- Interaction-Guided Two-Branch Image Dehazing Network [1.26404863283601]
Image dehazing aims to restore clean images from hazy ones.
CNNs and Transformers have demonstrated exceptional performance in local and global feature extraction.
We propose a novel dual-branch image dehazing framework that guides CNN and Transformer components interactively.
arXiv Detail & Related papers (2024-10-14T03:21:56Z) - CNN-Transformer Rectified Collaborative Learning for Medical Image Segmentation [60.08541107831459]
This paper proposes a CNN-Transformer rectified collaborative learning framework to learn stronger CNN-based and Transformer-based models for medical image segmentation.
Specifically, we propose a rectified logit-wise collaborative learning (RLCL) strategy which introduces the ground truth to adaptively select and rectify the wrong regions in student soft labels.
We also propose a class-aware feature-wise collaborative learning (CFCL) strategy to achieve effective knowledge transfer between CNN-based and Transformer-based models in the feature space.
arXiv Detail & Related papers (2024-08-25T01:27:35Z) - Boosting Hyperspectral Image Classification with Gate-Shift-Fuse Mechanisms in a Novel CNN-Transformer Approach [8.982950112225264]
This paper introduces an HSI classification model that includes two convolutional blocks, a Gate-Shift-Fuse (GSF) block and a transformer block.
The GSF block is designed to strengthen the extraction of local and global spatial-spectral features.
An effective attention mechanism module is also proposed to enhance the extraction of information from HSI cubes.
arXiv Detail & Related papers (2024-06-20T09:05:50Z) - Transformers and Slot Encoding for Sample Efficient Physical World Modelling [1.5498250598583487]
We propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene.
We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples.
arXiv Detail & Related papers (2024-05-30T15:48:04Z) - Traffic Sign Recognition Using Local Vision Transformer [1.8416014644193066]
This paper proposes a new novel model that blends the advantages of both convolutional and transformer-based networks for traffic sign recognition.
The proposed model includes convolutional blocks for capturing local correlations and transformer-based blocks for learning global dependencies.
The experimental evaluations demonstrate that the hybrid network with the locality module outperforms pure transformer-based models and some of the best convolutional networks in accuracy.
arXiv Detail & Related papers (2023-11-11T19:42:41Z) - Hybrid Focal and Full-Range Attention Based Graph Transformers [0.0]
We present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT)
FFGT combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information.
Our approach enhances the performance of existing Graph Transformers on various open datasets.
arXiv Detail & Related papers (2023-11-08T12:53:07Z) - Breast Ultrasound Tumor Classification Using a Hybrid Multitask
CNN-Transformer Network [63.845552349914186]
Capturing global contextual information plays a critical role in breast ultrasound (BUS) image classification.
Vision Transformers have an improved capability of capturing global contextual information but may distort the local image patterns due to the tokenization operations.
In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation.
arXiv Detail & Related papers (2023-08-04T01:19:32Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - LocalViT: Analyzing Locality in Vision Transformers [101.53997555864822]
This paper studies the influence of locality mechanisms in vision transformers.
We add locality to vision transformers into the feed-forward network.
For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Fusion of CNNs and statistical indicators to improve image
classification [65.51757376525798]
Convolutional Networks have dominated the field of computer vision for the last ten years.
Main strategy to prolong this trend relies on further upscaling networks in size.
We hypothesise that adding heterogeneous sources of information may be more cost-effective to a CNN than building a bigger network.
arXiv Detail & Related papers (2020-12-20T23:24:31Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.