TVConv: Efficient Translation Variant Convolution for Layout-aware
Visual Processing
- URL: http://arxiv.org/abs/2203.10489v2
- Date: Tue, 22 Mar 2022 07:28:02 GMT
- Title: TVConv: Efficient Translation Variant Convolution for Layout-aware
Visual Processing
- Authors: Jierun Chen, Tianlang He, Weipeng Zhuo, Li Ma, Sangtae Ha, S.-H. Gary
Chan
- Abstract summary: We develop efficient translation variant convolution (TVConv) for layout-aware visual processing.
TVConv significantly improves the efficiency of the convolution and can be readily plugged into various network architectures.
- Score: 10.996162201540695
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As convolution has empowered many smart applications, dynamic convolution
further equips it with the ability to adapt to diverse inputs. However, the
static and dynamic convolutions are either layout-agnostic or
computation-heavy, making it inappropriate for layout-specific applications,
e.g., face recognition and medical image segmentation. We observe that these
applications naturally exhibit the characteristics of large intra-image
(spatial) variance and small cross-image variance. This observation motivates
our efficient translation variant convolution (TVConv) for layout-aware visual
processing. Technically, TVConv is composed of affinity maps and a
weight-generating block. While affinity maps depict pixel-paired relationships
gracefully, the weight-generating block can be explicitly overparameterized for
better training while maintaining efficient inference. Although conceptually
simple, TVConv significantly improves the efficiency of the convolution and can
be readily plugged into various network architectures. Extensive experiments on
face recognition show that TVConv reduces the computational cost by up to 3.1x
and improves the corresponding throughput by 2.3x while maintaining a high
accuracy compared to the depthwise convolution. Moreover, for the same
computation cost, we boost the mean accuracy by up to 4.21%. We also conduct
experiments on the optic disc/cup segmentation task and obtain better
generalization performance, which helps mitigate the critical data scarcity
issue. Code is available at https://github.com/JierunChen/TVConv.
Related papers
- Optimizing Vision Transformers for Medical Image Segmentation and
Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections.
CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase.
Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z) - Convolutional Xformers for Vision [2.7188347260210466]
Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks.
The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs)
We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations.
We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr"omformer, and Linear Transformer, to reduce its GPU usage.
arXiv Detail & Related papers (2022-01-25T12:32:09Z) - Fast and High-Quality Image Denoising via Malleable Convolutions [72.18723834537494]
We present Malleable Convolution (MalleConv), as an efficient variant of dynamic convolution.
Unlike previous works, MalleConv generates a much smaller set of spatially-varying kernels from input.
We also build an efficient denoising network using MalleConv, coined as MalleNet.
arXiv Detail & Related papers (2022-01-02T18:35:20Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers.
We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - Skip-Convolutions for Efficient Video Processing [21.823332885657784]
Skip-Convolutions leverage the large amount of redundancies in video streams and save computations.
We replace all convolutions with Skip-Convolutions in two state-of-the-art architectures, namely EfficientDet and HRNet.
We reduce their computational cost consistently by a factor of 34x for two different tasks, without any accuracy drop.
arXiv Detail & Related papers (2021-04-23T09:10:39Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z) - Dynamic Region-Aware Convolution [85.20099799084026]
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions.
On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
arXiv Detail & Related papers (2020-03-27T05:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.