Related papers: TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing

TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing

URL: http://arxiv.org/abs/2203.10489v2
Date: Tue, 22 Mar 2022 07:28:02 GMT
Title: TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing
Authors: Jierun Chen, Tianlang He, Weipeng Zhuo, Li Ma, Sangtae Ha, S.-H. Gary Chan
Abstract summary: We develop efficient translation variant convolution (TVConv) for layout-aware visual processing. TVConv significantly improves the efficiency of the convolution and can be readily plugged into various network architectures.
Score: 10.996162201540695
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As convolution has empowered many smart applications, dynamic convolution further equips it with the ability to adapt to diverse inputs. However, the static and dynamic convolutions are either layout-agnostic or computation-heavy, making it inappropriate for layout-specific applications, e.g., face recognition and medical image segmentation. We observe that these applications naturally exhibit the characteristics of large intra-image (spatial) variance and small cross-image variance. This observation motivates our efficient translation variant convolution (TVConv) for layout-aware visual processing. Technically, TVConv is composed of affinity maps and a weight-generating block. While affinity maps depict pixel-paired relationships gracefully, the weight-generating block can be explicitly overparameterized for better training while maintaining efficient inference. Although conceptually simple, TVConv significantly improves the efficiency of the convolution and can be readily plugged into various network architectures. Extensive experiments on face recognition show that TVConv reduces the computational cost by up to 3.1x and improves the corresponding throughput by 2.3x while maintaining a high accuracy compared to the depthwise convolution. Moreover, for the same computation cost, we boost the mean accuracy by up to 4.21%. We also conduct experiments on the optic disc/cup segmentation task and obtain better generalization performance, which helps mitigate the critical data scarcity issue. Code is available at https://github.com/JierunChen/TVConv.

Related papers

Optimizing Vision Transformers for Medical Image Segmentation and Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections. CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase. Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z)
Convolutional Xformers for Vision [2.7188347260210466]
Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs) We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr"omformer, and Linear Transformer, to reduce its GPU usage.
arXiv Detail & Related papers (2022-01-25T12:32:09Z)
Fast and High-Quality Image Denoising via Malleable Convolutions [72.18723834537494]
We present Malleable Convolution (MalleConv), as an efficient variant of dynamic convolution. Unlike previous works, MalleConv generates a much smaller set of spatially-varying kernels from input. We also build an efficient denoising network using MalleConv, coined as MalleNet.
arXiv Detail & Related papers (2022-01-02T18:35:20Z)
DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels. We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z)
Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers. We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z)
Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs) We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z)
Skip-Convolutions for Efficient Video Processing [21.823332885657784]
Skip-Convolutions leverage the large amount of redundancies in video streams and save computations. We replace all convolutions with Skip-Convolutions in two state-of-the-art architectures, namely EfficientDet and HRNet. We reduce their computational cost consistently by a factor of 34x for two different tasks, without any accuracy drop.
arXiv Detail & Related papers (2021-04-23T09:10:39Z)
Visual Transformers: Token-based Image Representation and Processing for Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
Dynamic Region-Aware Convolution [85.20099799084026]
We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions. On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
arXiv Detail & Related papers (2020-03-27T05:49:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.