SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and
Transformers
- URL: http://arxiv.org/abs/2308.07110v1
- Date: Mon, 14 Aug 2023 12:49:39 GMT
- Title: SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and
Transformers
- Authors: Xijun Wang, Xiaojie Chu, Chunrui Han, Xiangyu Zhang
- Abstract summary: This paper presents a module, Spatial Cross-scale Convolution (SCSC), which is verified to be effective in improving both CNNs and Transformers.
On the face recognition task, FaceResNet with SCSC can improve 2.7% with 68% fewer FLOPs and 79% fewer parameters.
On the ImageNet classification task, Swin Transformer with SCSC can achieve even better performance with 22% fewer FLOPs, and ResNet with CSCS can improve 5.3% with similar complexity.
- Score: 18.073368359464915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a module, Spatial Cross-scale Convolution (SCSC), which
is verified to be effective in improving both CNNs and Transformers. Nowadays,
CNNs and Transformers have been successful in a variety of tasks. Especially
for Transformers, increasing works achieve state-of-the-art performance in the
computer vision community. Therefore, researchers start to explore the
mechanism of those architectures. Large receptive fields, sparse connections,
weight sharing, and dynamic weight have been considered keys to designing
effective base models. However, there are still some issues to be addressed:
large dense kernels and self-attention are inefficient, and large receptive
fields make it hard to capture local features. Inspired by the above analyses
and to solve the mentioned problems, in this paper, we design a general module
taking in these design keys to enhance both CNNs and Transformers. SCSC
introduces an efficient spatial cross-scale encoder and spatial embed module to
capture assorted features in one layer. On the face recognition task,
FaceResNet with SCSC can improve 2.7% with 68% fewer FLOPs and 79% fewer
parameters. On the ImageNet classification task, Swin Transformer with SCSC can
achieve even better performance with 22% fewer FLOPs, and ResNet with CSCS can
improve 5.3% with similar complexity. Furthermore, a traditional network (e.g.,
ResNet) embedded with SCSC can match Swin Transformer's performance.
Related papers
- CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction [14.377544481394013]
CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features.
This integration enables efficient processing of detailed local and broader contextual information.
Experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance.
arXiv Detail & Related papers (2024-10-15T09:27:26Z) - OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - Revisiting Image Deblurring with an Efficient ConvNet [24.703240497171503]
We propose a lightweight CNN network that features a large effective receptive field (ERF) and demonstrates comparable or even better performance than Transformers.
Our key design is an efficient CNN block dubbed LaKD, equipped with a large kernel depth-wise convolution and spatial-channel mixing structure.
We achieve +0.17dB / +0.43dB PSNR over the state-of-the-art Restormer on defocus / motion deblurring benchmark datasets with 32% fewer parameters and 39% fewer MACs.
arXiv Detail & Related papers (2023-02-04T20:42:46Z) - ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2.
It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units.
Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z) - Optimizing Vision Transformers for Medical Image Segmentation and
Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections.
CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase.
Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image
Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation.
We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.