Searching the Search Space of Vision Transformer
- URL: http://arxiv.org/abs/2111.14725v1
- Date: Mon, 29 Nov 2021 17:26:07 GMT
- Title: Searching the Search Space of Vision Transformer
- Authors: Minghao Chen, Kan Wu, Bolin Ni, Houwen Peng, Bei Liu, Jianlong Fu,
Hongyang Chao, Haibin Ling
- Abstract summary: Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection.
We propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space.
We provide design guidelines of general vision transformers with extensive analysis according to the space searching process.
- Score: 98.96601221383209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer has shown great visual representation power in substantial
vision tasks such as recognition and detection, and thus been attracting
fast-growing efforts on manually designing more effective architectures. In
this paper, we propose to use neural architecture search to automate this
process, by searching not only the architecture but also the search space. The
central idea is to gradually evolve different search dimensions guided by their
E-T Error computed using a weight-sharing supernet. Moreover, we provide design
guidelines of general vision transformers with extensive analysis according to
the space searching process, which could promote the understanding of vision
transformer. Remarkably, the searched models, named S3 (short for Searching the
Search Space), from the searched space achieve superior performance to recently
proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The
effectiveness of S3 is also illustrated on object detection, semantic
segmentation and visual question answering, demonstrating its generality to
downstream vision and vision-language tasks. Code and models will be available
at https://github.com/microsoft/Cream.
Related papers
- ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers [9.271932084757646]
3D occupancy represents the entire scene without distinguishing between foreground and background by the physical space into a grid map.
We propose our learning-first view attention mechanism for effective multi-view feature aggregation.
We present FlowOcc3D, a benchmark built on top existing high-quality datasets.
arXiv Detail & Related papers (2024-05-07T13:15:07Z) - VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding [47.58359136198136]
VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models.
It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models.
It identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs.
arXiv Detail & Related papers (2024-03-14T16:13:00Z) - Explainable Multi-Camera 3D Object Detection with Transformer-Based
Saliency Maps [0.0]
Vision Transformers (ViTs) have achieved state-of-the-art results on various computer vision tasks, including 3D object detection.
End-to-end implementation makes ViTs less explainable, which can be a challenge for deploying them in safety-critical applications.
We propose a novel method for generating saliency maps for a DetR-like ViT with multiple camera inputs used for 3D object detection.
arXiv Detail & Related papers (2023-12-22T11:03:12Z) - Searching a High-Performance Feature Extractor for Text Recognition
Network [92.12492627169108]
We design a domain-specific search space by exploring principles for having good feature extractors.
As the space is huge and complexly structured, no existing NAS algorithms can be applied.
We propose a two-stage algorithm to effectively search in the space.
arXiv Detail & Related papers (2022-09-27T03:49:04Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - GLiT: Neural Architecture Search for Global and Local Image Transformer [114.8051035856023]
We introduce the first Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition.
Our method can find more discriminative and efficient transformer variants than the ResNet family and the baseline ViT for image classification.
arXiv Detail & Related papers (2021-07-07T00:48:09Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Auto-MVCNN: Neural Architecture Search for Multi-view 3D Shape
Recognition [16.13826056628379]
In 3D shape recognition, multi-view based methods leverage human's perspective to analyze 3D shapes and have achieved significant outcomes.
We propose a neural architecture search method named Auto-MVCNN which is particularly designed for optimizing architecture in multi-view 3D shape recognition.
arXiv Detail & Related papers (2020-12-10T07:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.