SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
- URL: http://arxiv.org/abs/2304.03768v1
- Date: Fri, 7 Apr 2023 17:59:58 GMT
- Title: SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
- Authors: Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou
- Abstract summary: We present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.
SparseFormer circumvents most of dense operations on the image space and has much lower computational costs.
Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models.
- Score: 30.494412497158237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human visual recognition is a sparse process, where only a few salient visual
cues are attended to rather than traversing every detail uniformly. However,
most current vision networks follow a dense paradigm, processing every single
visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we
challenge this dense paradigm and present a new method, coined SparseFormer, to
imitate human's sparse visual recognition in an end-to-end manner. SparseFormer
learns to represent images using a highly limited number of tokens (down to 49)
in the latent space with sparse feature sampling procedure instead of
processing dense units in the original pixel space. Therefore, SparseFormer
circumvents most of dense operations on the image space and has much lower
computational costs. Experiments on the ImageNet classification benchmark
dataset show that SparseFormer achieves performance on par with canonical or
well-established models while offering better accuracy-throughput tradeoff.
Moreover, the design of our network can be easily extended to the video
classification with promising performance at lower computational costs. We hope
that our work can provide an alternative way for visual modeling and inspire
further research on sparse neural architectures. The code will be publicly
available at https://github.com/showlab/sparseformer
Related papers
- Bootstrapping SparseFormers from Vision Foundation Models [24.029898310518046]
We propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.
bootstrapped unimodal SparseFormer can reach 84.9% accuracy on IN-1K with only 49 tokens.
CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models.
arXiv Detail & Related papers (2023-12-04T16:04:41Z) - Learning Semantic Segmentation with Query Points Supervision on Aerial Images [57.09251327650334]
We present a weakly supervised learning algorithm to train semantic segmentation algorithms.
Our proposed approach performs accurate semantic segmentation and improves efficiency by significantly reducing the cost and time required for manual annotation.
arXiv Detail & Related papers (2023-09-11T14:32:04Z) - HQ3DAvatar: High Quality Controllable 3D Head Avatar [65.70885416855782]
This paper presents a novel approach to building highly photorealistic digital head avatars.
Our method learns a canonical space via an implicit function parameterized by a neural network.
At test time, our method is driven by a monocular RGB video.
arXiv Detail & Related papers (2023-03-25T13:56:33Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Glance and Focus Networks for Dynamic Visual Recognition [36.26856080976052]
We formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system.
The proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features.
It reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy.
arXiv Detail & Related papers (2022-01-09T14:00:56Z) - Sparse Spatial Transformers for Few-Shot Learning [6.271261279657655]
Learning from limited data is challenging because data scarcity leads to a poor generalization of the trained model.
We propose a novel transformer-based neural network architecture called sparse spatial transformers.
Our method finds task-relevant features and suppresses task-irrelevant features.
arXiv Detail & Related papers (2021-09-27T10:36:32Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data
via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only.
The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z) - Impression Space from Deep Template Network [72.86001835304185]
We show that a trained convolutional neural network has the capability to "remember" its input images.
We propose a framework to establish an emphImpression Space upon an off-the-shelf pretrained network.
arXiv Detail & Related papers (2020-07-10T15:29:33Z) - Neural Sparse Representation for Image Restoration [116.72107034624344]
Inspired by the robustness and efficiency of sparse coding based image restoration models, we investigate the sparsity of neurons in deep networks.
Our method structurally enforces sparsity constraints upon hidden neurons.
Experiments show that sparse representation is crucial in deep neural networks for multiple image restoration tasks.
arXiv Detail & Related papers (2020-06-08T05:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.