Glance and Focus Networks for Dynamic Visual Recognition
- URL: http://arxiv.org/abs/2201.03014v1
- Date: Sun, 9 Jan 2022 14:00:56 GMT
- Title: Glance and Focus Networks for Dynamic Visual Recognition
- Authors: Gao Huang, Yulin Wang, Kangchen Lv, Haojun Jiang, Wenhui Huang,
Pengfei Qi, Shiji Song
- Abstract summary: We formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system.
The proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features.
It reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy.
- Score: 36.26856080976052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial redundancy widely exists in visual recognition tasks, i.e.,
discriminative features in an image or video frame usually correspond to only a
subset of pixels, while the remaining regions are irrelevant to the task at
hand. Therefore, static models which process all the pixels with an equal
amount of computation result in considerable redundancy in terms of time and
space consumption. In this paper, we formulate the image recognition problem as
a sequential coarse-to-fine feature learning process, mimicking the human
visual system. Specifically, the proposed Glance and Focus Network (GFNet)
first extracts a quick global representation of the input image at a low
resolution scale, and then strategically attends to a series of salient (small)
regions to learn finer features. The sequential process naturally facilitates
adaptive inference at test time, as it can be terminated once the model is
sufficiently confident about its prediction, avoiding further redundant
computation. It is worth noting that the problem of locating discriminant
regions in our model is formulated as a reinforcement learning task, thus
requiring no additional manual annotations other than classification labels.
GFNet is general and flexible as it is compatible with any off-the-shelf
backbone models (such as MobileNets, EfficientNets and TSM), which can be
conveniently deployed as the feature extractor. Extensive experiments on a
variety of image classification and video recognition tasks and with various
backbone models demonstrate the remarkable efficiency of our method. For
example, it reduces the average latency of the highly efficient MobileNet-V3 on
an iPhone XS Max by 1.3x without sacrificing accuracy. Code and pre-trained
models are available at https://github.com/blackfeather-wang/GFNet-Pytorch.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Filling Missing Values Matters for Range Image-Based Point Cloud Segmentation [12.62718910894575]
Point cloud segmentation (PCS) plays an essential role in robot perception and navigation tasks.
To efficiently understand large-scale outdoor point clouds, their range image representation is commonly adopted.
However, undesirable missing values in the range images damage the shapes and patterns of objects.
This problem creates difficulty for the models in learning coherent and complete geometric information from the objects.
arXiv Detail & Related papers (2024-05-16T15:13:42Z) - Discriminative Feature Learning through Feature Distance Loss [0.0]
This work proposes a novel method that combines variant rich base models to concentrate on different important image regions for classification.
Experiments on benchmark convolutional neural networks (VGG16, ResNet, AlexNet), popular datasets (Cifar10, Cifar100, miniImageNet, NEU, BSD, TEX) show our methods effectiveness and generalization ability.
arXiv Detail & Related papers (2022-05-23T20:01:32Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - DenserNet: Weakly Supervised Visual Localization Using Multi-scale
Feature Aggregation [7.2531609092488445]
We develop a convolutional neural network architecture which aggregates feature maps at different semantic levels for image representations.
Second, our model is trained end-to-end without pixel-level annotation other than positive and negative GPS-tagged image pairs.
Third, our method is computationally efficient as our architecture has shared features and parameters during computation.
arXiv Detail & Related papers (2020-12-04T02:16:47Z) - Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in
Image Classification [46.885260723836865]
Deep convolutional neural networks (CNNs) generally improve when fueled with high resolution images.
Inspired by the fact that not all regions in an image are task-relevant, we propose a novel framework that performs efficient image classification.
Our framework is general and flexible as it is compatible with most of the state-of-the-art light-weighted CNNs.
arXiv Detail & Related papers (2020-10-11T17:55:06Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z) - Evolving Losses for Unsupervised Video Representation Learning [91.2683362199263]
We present a new method to learn video representations from large-scale unlabeled video data.
The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods.
arXiv Detail & Related papers (2020-02-26T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.