Related papers: Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks

Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks

URL: http://arxiv.org/abs/2303.03667v3
Date: Sun, 21 May 2023 15:04:11 GMT
Title: Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks
Authors: Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, S.-H. Gary Chan
Abstract summary: We propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices. Our large FasterNet-L achieves impressive $83.5%$ top-1 accuracy, on par with the emerging Swin-B, while having $36%$ higher inference throughput on GPU.
Score: 15.519170283930276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To design fast neural networks, many works have been focusing on reducing the number of floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does not necessarily lead to a similar level of reduction in latency. This mainly stems from inefficiently low floating-point operations per second (FLOPS). To achieve faster networks, we revisit popular operators and demonstrate that such low FLOPS is mainly due to frequent memory access of the operators, especially the depthwise convolution. We hence propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices, without compromising on accuracy for various vision tasks. For example, on ImageNet-1k, our tiny FasterNet-T0 is $2.8\times$, $3.3\times$, and $2.4\times$ faster than MobileViT-XXS on GPU, CPU, and ARM processors, respectively, while being $2.9\%$ more accurate. Our large FasterNet-L achieves impressive $83.5\%$ top-1 accuracy, on par with the emerging Swin-B, while having $36\%$ higher inference throughput on GPU, as well as saving $37\%$ compute time on CPU. Code is available at \url{https://github.com/JierunChen/FasterNet}.

Related papers

End-to-End Neural Network Compression via $\frac{\ell_1}{\ell_2}$ Regularized Latency Surrogates [20.31383698391339]
Our algorithm is versatile and can be used with many popular compression methods including pruning, low-rank factorization, and quantization. It is fast and runs in almost the same amount of time as single model training.
arXiv Detail & Related papers (2023-06-09T09:57:17Z)
GhostNetV2: Enhance Cheap Operation with Long-Range Attention [59.65543143580889]
We propose a hardware-friendly attention mechanism (dubbed DFC attention) and then present a new GhostNetV2 architecture for mobile applications. The proposed DFC attention is constructed based on fully-connected layers, which can not only execute fast on common hardware but also capture the dependence between long-range pixels. We further revisit the bottleneck in previous GhostNet and propose to enhance expanded features produced by cheap operations with DFC attention.
arXiv Detail & Related papers (2022-11-23T12:16:59Z)
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality. A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent. We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z)
GhostShiftAddNet: More Features from Energy-Efficient Operations [1.2891210250935146]
Deep convolutional neural networks (CNNs) are computationally and memory intensive. This paper proposes GhostShiftAddNet, where the motivation is to implement a hardware-efficient deep network. We introduce a new bottleneck block, GhostSA, that converts all multiplications in the block to cheap operations.
arXiv Detail & Related papers (2021-09-20T12:50:42Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks. Current networks often occupy large number of parameters and require heavy computation costs. Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
Efficient Neural Network Deployment for Microcontroller [0.0]
This paper is going to explore and generalize convolution neural network deployment for microcontrollers. The memory savings and performance will be compared with CMSIS-NN framework developed for ARM Cortex-M CPUs. The final purpose is to develop a tool consuming PyTorch model with trained network weights, and it turns into an optimized inference engine in C/C++ for low memory(kilobyte level) and limited computing capable microcontrollers.
arXiv Detail & Related papers (2020-07-02T19:21:05Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions [70.59851564292828]
Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. We propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to $1014times$ over conventional DNAS.
arXiv Detail & Related papers (2020-04-12T08:52:15Z)
Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs [6.035819238203187]
We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance. We also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM.
arXiv Detail & Related papers (2020-02-20T12:07:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.